Grounded Compositional Semantics for Finding and Describing Images with Sentences

Previous work on Recursive Neural Networks (RNNs) shows that these models can produce compositional feature vectors for accurately representing and classifying sentences or images. However, the sentence vectors of previous models cannot accurately represent visually grounded meaning. We introduce the DT-RNN model which uses dependency trees to embed sentences into a vector space in order to retrieve images that are described by those sentences. Unlike previous RNN-based models which use constituency trees, DT-RNNs naturally focus on the action and agents in a sentence. They are better able to abstract from the details of word order and syntactic expression. DT-RNNs outperform other recursive and recurrent neural networks, kernelized CCA and a bag-of-words baseline on the tasks of finding an image that fits a sentence description and vice versa. They also give more similar representations to sentences that describe the same image.


Introduction
Single word vector spaces are widely used (Turney and Pantel, 2010) and successful at classifying single words and capturing their meaning (Collobert and Weston, 2008;Huang et al., 2012;. Since words rarely appear in isolation, the task of learning compositional meaning representations for longer phrases has recently received a lot of attention (Mitchell and Lapata, 2010;Grefenstette et al., 2013). Similarly, classifying whole images into a fixed set of classes also achieves very high performance Krizhevsky et al., 2012). However, similar to words, objects in images are often seen in relationships with other objects which are not adequately described by a single label.
In this work, we introduce a model, illustrated in Fig. 1, which learns to map sentences and images into a common embedding space in order to be able to retrieve one from the other. We assume word and image representations are first learned in their respective single modalities but finally mapped into a jointly learned multimodal embedding space.
Our model for mapping sentences into this space is based on ideas from Recursive Neural Networks (RNNs) (Pollack, 1990;Costa et al., 2003;Socher et al., 2011b). However, unlike all previous RNN models which are based on constituency trees (CT-RNNs), our model computes compositional vector representations inside dependency trees. The compositional vectors computed by this new dependency tree RNN (DT-RNN) capture more of the meaning of sentences, where we define meaning in terms of similarity to a "visual representation" of the textual description. DT-RNN induced vector representations of sentences are more robust to changes in the syntactic structure or word order than related models such as CT-RNNs or Recurrent Neural Networks since they naturally focus on a sentence's action and its agents.
We evaluate and compare DT-RNN induced representations on their ability to use a sentence such as "A man wearing a helmet jumps on his bike near a beach." to find images that show such a scene. The goal is to learn sentence representations that capture A man wearing a helmet jumps on his bike near a beach.

Compositional Sentence Vectors
Two airplanes parked in an airport.
A man jumping his downhill bike.

Image Vector Representation
A small child sits on a cement wall near white flower. Figure 1: The DT-RNN learns vector representations for sentences based on their dependency trees. We learn to map the outputs of convolutional neural networks applied to images into the same space and can then compare both sentences and images. This allows us to query images with a sentence and give sentence descriptions to images. the visual scene described and to find appropriate images in the learned, multi-modal sentence-image space. Conversely, when given a query image, we would like to find a description that goes beyond a single label by providing a correct sentence describing it, a task that has recently garnered a lot of attention (Farhadi et al., 2010;Ordonez et al., 2011;Kuznetsova et al., 2012). We use the dataset introduced by  which consists of 1000 images, each with 5 descriptions. On all tasks, our model outperforms baselines and related models.

Related Work
The presented model is connected to several areas of NLP and vision research, each with a large amount of related work to which we can only do some justice given space constraints.
Semantic Vector Spaces and Their Compositionality.
The dominant approach in semantic vector spaces uses distributional similarities of single words. Often, co-occurrence statistics of a word and its context are used to describe each word (Turney and Pantel, 2010;Baroni and Lenci, 2010), such as tf-idf. Most of the compositionality algorithms and related datasets capture two-word compositions. For instance, (Mitchell and Lapata, 2010) use twoword phrases and analyze similarities computed by vector addition, multiplication and others. Compositionality is an active field of research with many different models and representations being explored (Grefenstette et al., 2013), among many others. We compare to supervised compositional models that can learn task-specific vector representations such as constituency tree recursive neural networks (Socher et al., 2011b;Socher et al., 2011a), chain structured recurrent neural networks and other baselines. Another alternative would be to use CCG trees as a backbone for vector composition (K.M. Hermann, 2013).

Multimodal Embeddings.
Multimodal embedding methods project data from multiple sources such as sound and video (Ngiam et al., 2011) or images and text. Socher et al. (Socher and Fei-Fei, 2010) project words and image regions into a common space using kernelized canonical correlation analysis to obtain state of the art performance in annotation and segmentation. Similar to our work, they use unsupervised large text corpora to learn semantic word representations. Among other recent work is that by Srivastava and Salakhutdinov (2012) who developed multimodal Deep Boltzmann Machines. Similar to their work, we use techniques from the broad field of deep learning to represent images and words.
Recently, single word vector embeddings have been used for zero shot learning (Socher et al., 2013c). Mapping images to word vectors enabled their system to classify images as depicting objects such as "cat" without seeing any examples of this class. Related work has also been presented at NIPS (Socher et al., 2013b;Frome et al., 2013). This work moves zero-shot learning beyond single categories per image and extends it to unseen phrases and full length sentences, making use of similar ideas of semantic spaces grounded in visual knowledge.

Detailed Image Annotation.
Interactions between images and texts is a growing research field. Early work in this area includes generating single words or fixed phrases from images (Duygulu et al., 2002;Barnard et al., 2003) or using contextual information to improve recognition (Gupta and Davis, 2008;Torralba et al., 2010).
Apart from a large body of work on single object image classification , there is also work on attribute classification and other mid-level elements (Kumar et al., 2009), some of which we hope to capture with our approach as well.
Our work is close in spirit with recent work in describing images with more detailed, longer textual descriptions. In particular, Yao et al. (2010) describe images using hierarchical knowledge and humans in the loop. In contrast, our work does not require human interactions. Farhadi et al. (2010) and , on the other hand, use a more automatic method to parse images. For instance, the former approach uses a single triple of objects estimated for an image to retrieve sentences from a collection written to describe similar images. It forms representations to describe 1 object, 1 action, and 1 scene.  extends their method to describe an image with multiple objects. None of these approaches have used a compositional sentence vector representation and they require specific language generation techniques and sophisticated inference methods. Since our model is based on neural networks inference is fast and simple. Kuznetsova et al. (2012) use a very large parallel corpus to connect images and sentences. Feng and Lapata (2013) use a large dataset of captioned images and experiments with both extractive (search) and abstractive (generation) models.
Most related is the very recent work of Hodosh et al. (2013). They too evaluate using a ranking measure. In our experiments, we compare to kernelized Canonical Correlation Analysis which is the main technique in their experiments.

Dependency-Tree Recursive Neural Networks
In this section we first focus on the DT-RNN model that computes compositional vector representations for phrases and sentences of variable length and syn-tactic type. In section 5 the resulting vectors will then become multimodal features by mapping images that show what the sentence describes to the same space and learning both the image and sentence mapping jointly. The most common way of building representations for longer phrases from single word vectors is to simply linearly average the word vectors. While this bag-of-words approach can yield reasonable performance in some tasks, it gives all the words the same weight and cannot distinguish important differences in simple visual descriptions such as The bike crashed into the standing car. vs. The car crashed into the standing bike..
RNN models (Pollack, 1990;Goller and Küchler, 1996;Socher et al., 2011b;Socher et al., 2011a) provided a novel way of combining word vectors for longer phrases that moved beyond simple averaging. They combine vectors with an RNN in binary constituency trees which have potentially many hidden layers. While the induced vector representations work very well on many tasks, they also inevitably capture a lot of syntactic structure of the sentence. However, the task of finding images from sentence descriptions requires us to be more invariant to syntactic differences. One such example are activepassive constructions which can collapse words such as "by" in some formalisms (de Marneffe et al., 2006), relying instead on the semantic relationship of "agent". For instance, The mother hugged her child. and The child was hugged by its mother. should map to roughly the same visual space. Current Recursive and Recurrent Neural Networks do not exhibit this behavior and even bag of words representations would be influenced by the words was and by. The model we describe below focuses more on recognizing actions and agents and has the potential to learn representations that are invariant to active-passive differences.

DT-RNN Inputs: Word Vectors and Dependency Trees
In order for the DT-RNN to compute a vector representation for an ordered list of m words (a phrase or sentence), we map the single words to a vector space and then parse the sentence. First, we map each word to a d-dimensional vector. We initialize these word vectors with the un-  Figure 2: Example of a full dependency tree for a longer sentence. The DT-RNN will compute vector representations at every word that represents that word and an arbitrary number of child nodes. The final representation is computed at the root node, here at the verb jumps. Note that more important activity and object words are higher up in this tree structure.
supervised model of Huang et al. (2012) which can learn single word vector representations from both local and global contexts. The idea is to construct a neural network that outputs high scores for windows and documents that occur in a large unlabeled corpus and low scores for window-document pairs where one word is replaced by a random word. When such a network is optimized via gradient descent the derivatives backpropagate into a word embedding matrix A which stores word vectors as columns. In order to predict correct scores the vectors in the matrix capture co-occurrence statistics. We use d = 50 in all our experiments. The embedding matrix X is then used by finding the column index i of each word: [w] = i and retrieving the corresponding column x w from X. Henceforth, we represent an input sentence s as an ordered list of (word,vector) pairs: s = ((w 1 , x w 1 ), . . . , (w m , x wm )).
Next, the sequence of words (w 1 , . . . , w m ) is parsed by the dependency parser of de Marneffe et al. (2006). Fig. 2 shows an example. We can represent a dependency tree d of a sentence s as an ordered list of (child,parent) indices: d(s) = {(i, j)}, where every child word in the sequence i = 1, . . . , m is present and has any word j ∈ {1, . . . , m} ∪ {0} as its parent. The root word has as its parent 0 and we notice that the same word can be a parent between zero and m number of times. Without loss of generality, we assume that these indices form a tree structure. To summarize, the input to the DT-RNN for each sentence is the pair (s, d): the words and their vectors and the dependency tree.
The DT-RNN model will compute parent vectors at each word that include all the dependent (children) nodes in a bottom up fashion using a compositionality function g θ which is parameterized by all the model parameters θ. To this end, the algorithm searches for nodes in a tree that have either (i) no children or (ii) whose children have already been computed and then computes the corresponding vector.
In our example, the words x 1 , x 3 , x 5 are leaf nodes and hence, we can compute their corresponding hidden nodes via: (1) where we compute the hidden vector at position c via our general composition function g θ . In the case of leaf nodes, this composition function becomes simply a linear layer, parameterized by W v ∈ R n×d , followed by a nonlinearity. We cross-validate over using no nonlinearity (f = id), tanh, sigmoid or rectified linear units (f = max(0, x), but generally find tanh to perform best.
The final sentence representation we want to compute is at h 2 , however, since we still do not have h 4 , we compute that one next: (2) where we use the same W v as before to map the word vector into hidden space but we now also have a linear layer that takes as input h 5 , the only child of the fourth node. The matrix W r1 ∈ R n×n is used because node 5 is the first child node on the right side of node 4. Generally, we have multiple matrices for composing with hidden child vectors from the right and left sides: W r· = (W r1 , . . . , W rkr ) and W l· = (W l1 , . . . , W lk l ). The number of needed matrices is determined by the data by simply finding the maximum numbers of left k l and right k r children any node has. If at test time a child appeared at an even large distance (this does not happen in our test set), the corresponding matrix would be the identity matrix.
Now that all children of h 2 have their hidden vectors, we can compute the final sentence representation via: Notice that the children are multiplied by matrices that depend on their location relative to the current node. Another modification that improves the mean rank by approximately 6 in image search on the dev set is to weight nodes by the number of words underneath them and normalize by the sum of words under all children. This encourages the intuitive desideratum that nodes describing longer phrases are more important. Let (i) be the number of leaf nodes (words) under node i and C(i, y) be the set of child nodes of node i in dependency tree y. The final composition function for a node vector h i becomes: where by definition (i) = 1 + j∈C(i) (j) and pos(i, j) is the relative position of child j with respect to node i, e.g. l1 or r2 in Eq. 3.

Semantic Dependency Tree RNNs
An alternative is to condition the weight matrices on the semantic relations given by the dependency parser. We use the collapsed tree formalism of the Stanford dependency parser (de Marneffe et al., 2006). With such a semantic untying of the weights, the DT-RNN makes better use of the dependency formalism and could give active-passive reversals similar semantic vector representation. The equation for this semantic DT-RNN (SDT-RNN) is the same as the one above except that the matrices W pos(i,j) are replaced with matrices based on the dependency relationship. There are a total of 141 unique such relationships in the dataset. However, most are very rare. For examples of semantic relationships, see Fig. 2 and the model analysis section 6.7.
This forward propagation can be used for computing compositional vectors and in Sec. 5 we will explain the objective function in which these are trained.

Comparison to Previous RNN Models
The DT-RNN has several important differences to previous RNN models of Socher et al. (2011a) and (Socher et al., 2011b;Socher et al., 2011c). These constituency tree RNNs (CT-RNNs) use the following composition function to compute a hidden parent vector h from exactly two child vectors (c 1 , c 2 ) in a binary tree: is the main parameter to learn. This can be rewritten to show the similarity to the DT-RNN as h = f (W l1 c 1 + W r1 c 2 ). However, there are several important differences. Note first that in previous RNN models the parent vectors were of the same dimensionality to be recursively compatible and be used as input to the next composition. In contrast, our new model first maps single words into a hidden space and then parent nodes are composed from these hidden vectors. This allows a higher capacity representation which is especially helpful for nodes that have many children.
Secondly, the DT-RNN allows for n-ary nodes in the tree. This is an improvement that is possible even for constituency tree CT-RNNs but it has not been explored in previous models.
Third, due to computing parent nodes in constituency trees, previous models had the problem that words that are merged last in the tree have a larger weight or importance in the final sentence rep- resentation. This can be problematic since these are often simple non-content words, such as a leading 'But,'. While such single words can be important for tasks such as sentiment analysis, we argue that for describing visual scenes the DT-RNN captures the more important effects: The dependency tree structures push the central content words such as the main action or verb and its subject and object to be merged last and hence, by construction, the final sentence representation is more robust to less important adjectival modifiers, word order changes, etc.
Fourth, we allow some untying of weights depending on either how far away a constituent is from the current word or what its semantic relationship is. Now that we can compute compositional vector representations for sentences, the next section describes how we represent images.

Learning Image Representations with Neural Networks
The image features that we use in our experiments are extracted from a deep neural network, replicated from the one described in . The network was trained using both unlabeled data (random web images) and labeled data to classify 22,000 categories in ImageNet (Deng et al., 2009). We then used the features at the last layer, before the classifier, as the feature representation in our experiments. The dimension of the feature vector of the last layer is 4,096. The details of the model and its training procedures are as follows.
The architecture of the network can be seen in Figure 4. The network takes 200x200 pixel images as inputs and has 9 layers. The layers consist of three sequences of filtering, pooling and local contrast normalization (Jarrett et al., 2009). The pooling function is L2 pooling of the previous layer (taking the square of the filtering units, summing them up in a small area in the image, and taking the squareroot). The local contrast normalization takes inputs in a small area of the lower layer, subtracts the mean and divides by the standard deviation.
The network was first trained using an unsupervised objective: trying to reconstruct the input while keeping the neurons sparse. In this phase, the network was trained on 20 million images randomly sampled from the web. We resized a given image so that its short dimension has 200 pixels. We then cropped a fixed size 200x200 pixel image right at the center of the resized image. This means we may discard a fraction of the long dimension of the image.
After unsupervised training, we used Ima-geNet (Deng et al., 2009) to adjust the features in the entire network. The ImageNet dataset has 22,000 categories and 14 million images. The number of images in each category is equal across categories. The 22,000 categories are extracted from WordNet.
To speed up the supervised training of this network, we made a simple modification to the algorithm described in : adding a "bottleneck" layer in between the last layer and the classifier. to reduce the number of connections. We added one "bottleneck" layer which has 4,096 units in between the last layer of the network and the softmax layer. This newly-added layer is fully connected to the previous layer and has a linear activation function. The total number of connections of this network is approximately 1.36 billion.
The network was trained again using the supervised objective of classifying the 22,000 classes in ImageNet. Most features in the networks are local, which allows model parallelism. Data parallelism by asynchronous SGD was also employed as in . The entire training, both unsupervised and supervised, took 8 days on a large cluster of machines. This network achieves 18.3% precision@1 on the full ImageNet dataset (Release Fall 2011).
We will use the features at the bottleneck layer as the feature vector z of an image. Each scaled and cropped image is presented to our network. The network then performs a feedforward computation to compute the values of the bottleneck layer. This means that every image is represented by a fixed length vector of 4,096 dimensions. Note that during training, no aligned sentence-image data was used and the ImageNet classes do not fully intersect with the words used in our dataset.

Multimodal Mappings
The previous two sections described how we can map sentences into a d = 50-dimensional space and how to extract high quality image feature vectors of 4096 dimensions. We now define our final multimodal objective function for learning joint imagesentence representations with these models. Our training set consists of N images and their feature vectors z i and each image has 5 sentence descriptions s i1 , . . . , s i5 for which we use the DT-RNN to compute vector representations. See Fig. 5 for examples from the dataset. For training, we use a maxmargin objective function which intuitively trains pairs of correct image and sentence vectors to have high inner products and incorrect pairs to have low inner products. Let v i = W I z i be the mapped image vector and y ij = DT RN N θ (s ij ) the composed sentence vector. We define S to be the set of all sentence indices and S(i) the set of sentence indices corresponding to image i. Similarly, I is the set of all image indices and I(j) is the image index of sentence j. The set P is the set of all correct image-sentence training pairs (i, j). The ranking cost function to minimize is then: where θ are the language composition matrices, and both second sums are over other sentences coming from different images and vice versa. The hyperparameter ∆ is the margin. The margin is found via cross validation on the dev set and usually around 1.
The final objective also includes the regularization term λ/lef t( θ 2 2 + W I F ). Both the visual model and the word vector learning require a very large amount of training data and both have a huge number of parameters. Hence, to prevent overfitting, we assume their weights are fixed and only train the DT-RNN parameters W I . If larger training corpora become available in the future, training both jointly becomes feasible and would present a very promising direction. We use a modified version of Ada-Grad (Duchi et al., 2011) for optimization of both W I and the DT-RNN as well as the other baselines (except kCCA). Adagrad has achieved good performance previously in neural networks models Socher et al., 2013a). We modify it by resetting all squared gradient sums to 1 every 5 epochs. With both images and sentences in the same multimodal space, we can easily query the model for similar images or sentences by finding the nearest neighbors in terms of negative inner products.
An alternative objective function is based on the squared loss J(W I , θ) = (i,j)∈P v i − y j 2 2 . This requires an alternating minimization scheme that first trains only W I , then fixes W I and trains the DT-RNN weights θ and then repeats this several times. We find that the performance with this objective function (paired with finding similar images using Euclidean distances) is worse for all models than the margin loss of Eq. 5. In addition kCCA also performs much better using inner products in the multimodal space.

Experiments
We use the dataset of Rashtchian et al. (2010) which consists of 1000 images, each with 5 sentences. See Fig. 5 for examples.
We evaluate and compare the DT-RNN in three different experiments. First, we analyze how well the sentence vectors capture similarity in visual meaning. Then we analyze Image Search with Query Sentences: to query each model with a sentence in order to find an image showing that sen- Figure 5: Examples from the dataset of images and their sentence descriptions . Sentence length varies greatly and different objects can be mentioned first. Hence, models have to be invariant to word ordering. tence's visual 'meaning.' The last experiment Describing Images by Finding Suitable Sentences does the reverse search where we query the model with an image and try to find the closest textual description in the embedding space.
In our comparison to other methods we focus on those models that can also compute fixed, continuous vectors for sentences. In particular, we compare to the RNN model on constituency trees of Socher et al. (2011a), a standard recurrent neural network; a simple bag-of-words baseline which averages the words. All models use the word vectors provided by Huang et al. (2012) and do not update them as discussed above. Models are trained with their corresponding gradients and backpropagation techniques. A standard recurrent model is used where the hidden vector at word index t is computed from the hidden vector at the previous time step and the current word vector: h t = f (W h h t−1 + W x x t ). During training, we take the last hidden vector of the sentence chain and propagate the error into that. It is also this vector that is used to represent the sentence.
Other possible comparisons are to the very different models mentioned in the related work section. These models use a lot more task-specific engineering, such as running object detectors with bounding boxes, attribute classifiers, scene classifiers, CRFs for composing the sentences, etc. Another line of work uses large sentence-image aligned resources (Kuznetsova et al., 2012), whereas we focus on easily obtainable training data of each modality separately and a rather small multimodal corpus.
In our experiments we split the data into 800 training, 100 development and 100 test images. Since there are 5 sentences describing each image, we have 4000 training sentences and 500 testing sentences. The dataset has 3020 unique words, half of which only appear once. Hence, the unsupervised, pre-trained semantic word vector representations are crucial. Word vectors are not fine tuned during training. Hence, the main parameters are the DT-RNN's W l· , W r· or the semantic matrices of which there are 141 and the image mapping W I . For both DT-RNNs the weight matrices are initialized to block identity matrices plus Gaussian noise. Word vectors and hidden vectors are set o length 50. Using the development split, we found λ = 0.08 and the learning rate of AdaGrad to 0.0001. The best model uses a margin of ∆ = 3.
Inspired by Socher and Fei-Fei (2010) and Hodosh et al. (2013) we also compare to kernelized Canonical Correlation Analysis (kCCA). We use the average of word vectors for describing sentences and the same powerful image vectors as before. We use the code of Socher and Fei-Fei (2010). Technically, one could combine the recently introduced deep CCA Andrew et al. (2013) and train the recursive neural network architectures with the CCA objective. We leave this to future work. With linear kernels, kCCA does well for image search but is worse for sentence self similarity and describing images with sentences close-by in embedding space. All other models are trained by replacing the DT-RNN function in Eq. 5.

Similarity of Sentences Describing the Same Image
In this experiment, we first map all 500 sentences from the test set into the multi-modal space. Then for each sentence, we find the nearest neighbor sen- tences in terms of inner products. We then sort these neighbors and record the rank or position of the nearest sentence that describes the same image. If all the images were very unique and the visual descriptions close-paraphrases and consistent, we would expect a very low rank. However, usually a handful of images are quite similar (for instance, there are various images of airplanes flying, parking, taxiing or waiting on the runway) and sentence descriptions can vary greatly in detail and specificity for the same image. Table 1 (left) shows the results. We can see that averaging the high quality word vectors already captures a lot of similarity. The chain structure of a standard recurrent neural net performs worst since its representation is dominated by the last words in the sequence which may not be as important as earlier words.

Image Search with Query Sentences
This experiment evaluates how well we can find images that display the visual meaning of a given sentence. We first map a query sentence into the vector space and then find images in the same space using simple inner products. As shown in Table 1 (center), the new DT-RNN outperforms all other models.

Describing Images by Finding Suitable Sentences
Lastly, we repeat the above experiments but with roles reversed. For an image, we search for suitable textual descriptions again simply by finding closeby sentence vectors in the multi-modal embedding space.  Table 2: Results of multimodal ranking when models are trained with a squared error loss and using Euclidean distance in the multimodal space. Better performance is reached for all models when trained in a max-margin loss and using inner products as in the previous table. ages. The average ranking of 25.3 for a correct sentence description is out of 500 possible sentences. A random assignment would give an average ranking of 100.

Analysis: Squared Error Loss vs. Margin Loss
We analyze the influence of the multimodal loss function on the performance. In addition, we compare using Euclidean distances instead of inner products. Table 2 shows that performance is worse for all models in this setting. Hodosh et al. (2013) and other related work use recall at n as an evaluation measure. Recall at n captures how often one of the top n closest vectors were a correct image or sentence and gives a good intuition of how a model would perform in a ranking task that presents n such results to a user. Below, we compare three commonly used and high performing models: bag of words, kCCA and our SDT-RNN on  this different metric. Table 3 shows that the measures do correlate well and the SDT-RNN also performs best on the multimodal ranking tasks when evaluated with this measure.

Error Analysis
In order to understand the main problems with the composed sentence vectors, we analyze the sentences that have the worst nearest neighbor rank between each other. We find that the main failure mode of the SDT-RNN occurs when a sentence that should describe the same image does not use a verb but the other sentences of that image do include a verb. For example, the following sentence pair has vectors that are very far apart from each other even though they are supposed to describe the same image: Generally, as long as both sentences either have a verb or do not, the SDT-RNN is more robust to different sentence lengths than bag of words representations.

Model Analysis: Semantic Composition Matrices
The best model uses composition matrices based on semantic relationships from the dependency parser. We give some insights into what the model learns by listing the composition matrices with the largest Frobenius norms. Intuitively, these matrices have learned larger weights that are being multiplied with the child vector in the tree and hence that child will have more weight in the final composed parent vector. In decreasing order of Frobenius norm, the relationship matrices are: nominal subject, possession modifier (e.g. their), passive auxiliary, preposition at, preposition in front of, passive auxiliary, passive nominal subject, object of preposition, preposition in and preposition on. The model learns that nouns are very important as well as their spatial prepositions and adjectives.

Conclusion
We introduced a new recursive neural network model that is based on dependency trees. For evaluation, we use the challenging task of mapping sentences and images into a common space for finding one from the other. Our new model outperforms baselines and other commonly used models that can compute continuous vector representations for sentences. In comparison to related models, the DT-RNN is more invariant and robust to surface changes such as word order.