Learning Representations Specialized in Spatial Knowledge: Leveraging Language and Vision

Spatial understanding is crucial in many real-world problems, yet little progress has been made towards building representations that capture spatial knowledge. Here, we move one step forward in this direction and learn such representations by leveraging a task consisting in predicting continuous 2D spatial arrangements of objects given object-relationship-object instances (e.g., “cat under chair”) and a simple neural network model that learns the task from annotated images. We show that the model succeeds in this task and, furthermore, that it is capable of predicting correct spatial arrangements for unseen objects if either CNN features or word embeddings of the objects are provided. The differences between visual and linguistic features are discussed. Next, to evaluate the spatial representations learned in the previous task, we introduce a task and a dataset consisting in a set of crowdsourced human ratings of spatial similarity for object pairs. We find that both CNN (convolutional neural network) features and word embeddings predict human judgments of similarity well and that these vectors can be further specialized in spatial knowledge if we update them when training the model that predicts spatial arrangements of objects. Overall, this paper paves the way towards building distributed spatial representations, contributing to the understanding of spatial expressions in language.


Introduction
Representing spatial knowledge is instrumental in any task involving text-to-scene conversion such as robot understanding of natural language commands (Guadarrama et al., 2013;Moratz and Tenbrink, 2006) or a number of robot navigation tasks. Despite recent advances in building specialized representations in domains such as sentiment analysis (Tang et al., 2014), semantic similarity/relatedness (Kiela et al., 2015) or dependency parsing (Bansal et al., 2014), little progress has been made towards building distributed representations (a.k.a. embeddings) specialized in spatial knowledge.
Intuitively, one may reasonably expect that the more attributes two objects share (e.g., size, functionality, etc.), the more likely they are to exhibit similar spatial arrangements with respect to other objects. Leveraging this intuition, we foresee that visual and linguistic representations can be spatially informative about unseen objects as they encode features/attributes of objects (Collell and Moens, 2016). For instance, without having ever seen an "elephant" before, but only a "horse", one would probably devise the "elephant" carrying the "human" than otherwise, just by considering their size attribute. Similarly, one can infer that a "tablet" and a "book" will show similar spatial patterns (usually on a table, in someone's hands, etc.) although they barely show any visual resemblance-yet they are similar in size and functionality. In this paper we systematically study how informative visual and linguistic features-in the form of convolutional neural network (CNN) features and word embeddings-are about the spatial behavior of objects.
An important goal of this work is to learn distributed representations specialized in spatial knowledge. As a vehicle to learn spatial representations, we leverage the task of predicting the 2D spatial arrangement for two objects under a relationship expressed by either a preposition (e.g., "below" or "on") or a verb (e.g., "riding", "jumping", etc.). For that, we make use of images where both objects are annotated with bounding boxes. For instance, in an image depicting (horse, jumping, fence) we reasonably expect to find the "horse" above the "fence". To learn the task, we employ a feed forward network that represents objects as continuous (spatial) features in an embedding layer and guides the learning with a distance-based supervision on the objects' coordinates. We show that the model fares well in this task and that by informing it with either word embeddings or CNN features it is able to output accurate predictions about unseen objects, e.g., predicting the spatial arrangement of (man, riding, bike) without having ever been exposed to a "bike" before. This result suggests that the semantic and visual knowledge carried by the visual and linguistic features correlates to a certain extent with the spatial properties of words, thus providing predictive power for unseen objects.
To evaluate the quality of the spatial representations learned in the previous task, we introduce a task consisting in a set of 1,016 human ratings of spatial similarity between object pairs. It is thus desirable for spatial representations that "spatially similar" objects (i.e., objects that are arranged spatially similar in most situations and relative to other objects) have similar embeddings. In these ratings we show, first, that both CNN features and word embeddings are good predictors of human judgments, and second, that these vectors can be further specialized in spatial knowledge if we update them by backpropagation when learning the model in the task of predicting spatial arrangements of objects.
The rest of the paper is organized as follows. In Sect. 2 we review related research. In Sect. 3 we describe two spatial tasks and a model. In Sect. 4 we describe our experimental setup. In Sect. 5 we present and discuss our results. Finally, in Sect. 6 we summarize our contributions.

Related Work
Contrary to earlier rule-based approaches to spatial understanding (Kruijff et al., 2007;Moratz and Ten-brink, 2006), Malinowski and Fritz (2014) propose a learning-based method that learns the parameters of "spatial templates" (or regions of acceptability of an object under a spatial relation) using a pooling approach. They show improved performance in image retrieval and image annotation (i.e., retrieving sentences given a query image) over previous rule-based systems and methods that rely on handcrafted templates. Contrary to us, they restrict to relationships expressed by explicit spatial prepositions (e.g., "on" or "below") while we also consider actions (e.g., "jumping"). Furthermore, they do not build spatial representations for objects.
Other approaches have shown the value of properly integrating spatial information into a variety of tasks. For example, Shiang et al. (2017) improve over the state-of-the-art object recognition by leveraging previous knowledge of object co-occurrences and relative positions of objects-which they mine from text and the web-in order to rank possible object detections. In a similar fashion, Lin and Parikh (2015) leverage common sense visual knowledge (e.g., object locations and co-occurrences) in two tasks: fill-in-the-blank and visual paraphrasing. They compute the likelihood of a scene to identify the most likely answer to multiple-choice textual scene descriptions. In contrast, we focus solely on spatial information rather than semantic plausibility. Moreover, our primary target is to build (spatial) representations. Alternatively, Elliott and Keller (2013) annotate geometric relationships between objects in images (e.g., they add an "on" link between "man" and "bike" in an image of a "man" "riding" a "bike") to better infer the action present in the image. For instance, if the "man" is next to the bike one can infer that the action "repairing" is more likely than "riding" in this image. Accounting for this extra spatial structure allows them to outperform bag-offeatures methods in an image captioning task. In contrast with those who restrict to a small domain of 10 actions (e.g., "taking a photo", "riding", etc.), our goal is to generalize to any unseen/rare objects and actions by learning from frequent spatial configurations and objects, and critically, leveraging representations of objects. Recent work (Collell et al., 2018) tackles the research question of whether relative spatial arrangements can be predicted equally well from actions (e.g., "riding") than from spatial prepositions (e.g., "below"), and how to interpret the learned weights of the network. In contrast, our research questions concern spatial representations. Crucially, none of the studies above have considered or attempted to learn distributed spatial representations of objects, nor studied how much spatial knowledge is contained in visual and linguistic representations.
The existence of quantitative, continuous spatial representations of objects has been formerly discussed, yet to our knowledge, not systematically investigated before. For instance, Forbus et al. (1991) conjectured that "there is no purely qualitative, general purpose representation of spatial properties", further emphasizing that the quantitative component is strictly necessary.
It is also worth commenting on early work aimed at enhancing the understanding of natural spatial language such as the L 0 project (Feldman et al., 1996). In the context of this project, Regier (1996) proposed a connectionist model that learns to predict a few spatial prepositions ("above", "below", "left", "right", "in", "out", "on", and "off") from low resolution videos containing a limited set of toy objects (circle, square, etc.). In contrast, we consider an unlimited vocabulary of real-world objects, and we do not restrict to spatial prepositions but we include actions, as well. Hence, Regier's (1996) setting does not seem plausible to deal with actions given that, in contrast to the spatial prepositions that they use, which are mutually exclusive (an object cannot be "above" and simultaneously "below" another object), actions are not. In particular, actions exhibit large spatial overlap and, therefore, attempt to predict thousands of different actions from the relative locations of the objects seems infeasible. Additionally, Regier's (1996) architecture does not allow to meaningfully extract representations of objects from the visual input-which yields rather visual features.
Here, we propose an ad hoc setting for both, learning and evaluating spatial representations. In particular, instead of learning to predict spatial relations from visual input as in Regier's (1996) work, we learn the reverse direction, i.e., to map the relation (and two objects) to their visual spatial arrangement. By backpropagating the embeddings of the objects while learning the task, we enable learning spatial representations. As a core finding, we show in an ad hoc task, namely our collected human ratings of spatial similarity, that the learned features are more specialized in spatial knowledge than the CNN features and word embeddings that were used to initialize the parameters of the embeddings.

Tasks and Model
Here, we first describe the Prediction task and model that we use to learn the spatial representations. We subsequently present the spatial Similarity task which is employed to evaluate the quality of the learned representations.

Prediction Task
To evaluate the ability of a model or embeddings to learn spatial knowledge, we employ the task of predicting the spatial location of an Object ("O") relative to a Subject ("S") under a Relationship ("R").
denote the coordinates of the center ("c") of the Object's bounding box, where O c x ∈ R and O c y ∈ R are its x and y compo- sizes of the Object's bounding box ("b"). A similar notation applies to the Subject (i.e., S c and S b ), and we denote model predictions with a hat O c , O b . The task is to learn a mapping from the structured textual input (Subject, Relation, Object)-abbreviated by (S, R, O)-to the output consisting of the Object's center coordinates O c and its size O b (see Fig. 1).
We notice that a "Subject" is not necessarily a syntactic subject but simply a convenient notation to accommodate the case where the Relationship (R) is an action (e.g., "riding" or "wearing"), while when R is a spatial preposition (e.g., "below" or "on") the Subject simply denotes the referent object. Similarly, the Object is not necessarily a direct object. 1

Regression Model
Following the task above (Sect. 3.1), we consider a model ( Fig. 1) Figure 1: Overview of the model (right) and the image pre-processing setting (left).
are the vocabulary sizes. The embedding layer models our intuition that spatial properties of objects can be, to a certain extent, encoded with a vector of continuous features. In this work we test two types of embeddings, visual and linguistic. The next layer simply concatenates the three embeddings together with the Subject's size S b and Subject center S c . The inclusion of the Subject's size is aimed at providing a reference size to the model in order to predict the size of is then fed into a hidden layer(s) which acts as a composition function for the triplet (S, R, O): where f (·) is the non-linearity and W h and b h the parameters of the layer. These "composition layers" allow to distinguish between e.g., (man, walks, horse) which is spatially distinct from (man, rides, horse). We find that adding more layers generally improves performance, so the output z above can simply be composed with more layers, i.e., f (W h 2 z + b h 2 ). Finally, a linear output layer tries to match the ground truth targets y = (O c , O b ) using a mean squared error (MSE) loss function: is the model prediction and · denotes the Euclidean norm. Critically, unlike CNNs, the model does not make use of the pixels (which are discarded during the image preprocessing ( Fig. 1 and Sect. 3.2.1)), but learns exclusively from image coordinates, yielding a simpler model focused solely on spatial information.

Image Pre-Processing
We perform the following pre-processing steps to the images before feeding them to the model. (i) Normalize the image coordinates by the number of pixels of each axis (vertical and horizontal). This step guarantees that coordinates are independent of the resolution of the image and always lie within (ii) Mirror the image (when necessary). We notice that the distinction between right and left is arbitrary in images since a mirrored image completely preserves its spatial meaning. For instance, a "man" "feeding" an "elephant" can be arbitrarily at either side of the "elephant", while a "man" "riding" an "elephant" cannot be either below or above the "elephant". This left/right arbitrariness has also been acknowledged in prior work (Singhal et al., 2003). Thus, to enable a more meaningful learning, we mirror the image when (and only when) the Object is at the left-hand side of the Subject. 4 The choice of leaving the Object always to the right-hand side is arbitrary and does not entail a loss of generality, i.e., we can consider left/right symmetrically reflected predictions as equiprobable. Mirroring provides thus a more realistic performance evaluation in the Prediction task and enables learning representations independent of the right/left distinction which is irrelevant for the spatial semantics.

Spatial Similarity Task
To evaluate how well our embeddings match human mental representations of spatial knowledge about objects, we collect ratings for 1,016 word pairs (w 1 , w 2 ) asking annotators to rate them by their spatial similarity. That is, objects that exhibit similar locations in most situations and are placed similarly relative to other objects would receive a high score, and lower otherwise. For example (cap, sunglasses) would receive a high score as they are usually at the top of the human body, while following a similar logic, (cap, shoes) would receive a lower score. Our collected ratings establish the spatial counterpart to other existing similarity ratings such as semantic similarity (Silberer and Lapata, 2014), vi-sual similarity (Silberer and Lapata, 2014) or general relatedness (Agirre et al., 2009). A few exemplars of ratings are shown in Tab. 1. Following standard practices (Pennington et al., 2014), we compute the prediction of similarity between two embeddings s w 1 and s w 2 (representing words w 1 and w 2 ) with their cosine similarity: We notice that this spatial Similarity task does not involve learning and its main purpose is to evaluate the quality of the representations learned in the Prediction task (Sect. 3.1) and the spatial informativeness of visual and linguistic features.

Experimental Setup
In this section we describe the experimental settings employed in the tasks and the model.

Visual Genome Data Set
We obtain our annotated data from Visual Genome (Krishna et al., 2017). This dataset contains 108,077 images and over 1.5M human-annotated object- do not restrict to any particular domain (e.g., furniture or landscapes) the combinations (S, R, O) are markedly sparse, which makes learning our Prediction task especially challenging.

Evaluation Sets in the Prediction Task
In the Prediction task, we consider the following subsets of Visual Genome (Sect. 4.1) for evaluation purposes: (i) Original set: a test split from the original data which contains instances unseen at training time.
That is, the test combinations (S, R, O) might have been seen at training time, yet in different instances (e.g., in different images). This set contains a large number of noisy combinations such as (people, walk, funny) or (metal, white, chandelier).
(ii) Unseen Words set: We randomly select a list of 25 objects (e.g., "wheel", "camera", "elephant", etc.) among the 100 most frequent objects in Visual Genome. 5 We choose them among the most frequent ones in order to avoid meaningless objects such as "gate 2", "number 40" or "2:10 pm" which are not infrequent in Visual Genome. We then take all instances of combinations that contain any of these words, yielding ∼ 123K instances. For example, since "cap" is in our list, (girl, wears, cap) is included in this set. When we enforce "unseen" conditions, we remove all these instances from the training set, using them only for testing.

Visual and Linguistic Features
As our linguistic representations, we employ 300dimensional GloVe vectors (Pennington et al., 2014) trained on the Common Crawl corpus with 840Btokens and a 2.2M words vocabulary. 6 We use the publicly available visual representations from Collell et al. (2017). 7 They extract 128dimensional visual features with the forward pass of a VGG-128 (Visual Geometry Group) CNN model (Chatfield et al., 2014) pre-trained in ImageNet (Russakovsky et al., 2015). The representation of a word is the averaged feature vector (centroid) of all images in ImageNet for this concept. They only keep words with at least 50 images available. We notice that although we employ visual features from an external source (ImageNet), these could be alternatively obtained in the Visual Genome dataalthough ImageNet generally provides a larger number of images per concept.

Method Comparison
We consider two types of models, those that update the parameters of the embeddings (U ∼ "Update") and those that keep them fixed (NU ∼ "No Update") when learning the Prediction task. For each type (U and NU) we consider two conditions, embeddings initialized with pre-trained vectors (INI) and random embeddings (RND) randomly drawn from a component-wise normal distribution of mean and standard deviation equal to those of the original embeddings. For example, U-RND corresponds to a model with updated, random embeddings. For the INI methods we also add a subindex indicating whether the embeddings are visual (vis) or linguistic (lang), as described in Sect. 4.3. 8 For the NU type we additionally consider one-hot embeddings (1H). We also include a control method (rand-pred) that outputs random uniform predictions.

Implementation Details and Validation
To validate results in our Prediction task we employ a 10-fold cross-validation (CV) scheme. That is, we split the data into 10 parts and employ 90% of the data for training and 10% for testing. This yields 10 embeddings (for each "U" method), which are then evaluated in our Similarity task. In both tasks, we report results averaged across the 10 folds.
Model hyperparameters are first selected by cross-validation in 10 initial splits and results are reported in 10 new splits. All models employ a learning rate of 0.0001 and are trained for 10 epochs by backpropagation with the RMSprop optimizer. The dimensionality of the embeddings is the original one, i.e., d=300 for GloVe and d=128 for VGG-128 (Sect. 4.3), which is preserved for the random-embedding methods RND (Sect. 4.4). Models employ 2 hidden layers with 100 Rectified Linear Units (ReLu), followed by an output layer with a linear activation. Early stopping is employed as a regularizer. We implement our models with Keras deep learning framework in Python 2.7 (Chollet and others, 2015).

Spatial Similarity Task
To build the word pairs, we randomly select a list of objects from Visual Genome and from these we randomly chose 1,016 non-repeated word pairs (w 1 , w 2 ). Ratings are collected with the Crowdflower 9 platform and correspond to averages of at least 5 reliable annotators 10 that provided ratings in a discrete scale from 1 to 10. The median similarity rating is 3.3 and the mean variance between annotators per word pair is ∼1.2.

Prediction Task
We evaluate model predictions with the following metrics.
(I) Regression metrics. . R 2 is employed to evaluate goodness of fit of a regression model and is related to the percentage of variance of the target explained by the predictions. The best possible score is 1 and it can be arbitrarily negative for bad predictions. A model that outputs either random or constant predictions would obtain scores close to 0 and exactly 0 respectively. 9 https://www.crowdflower.com/ 10 Reliable annotators are those with performance over 70% in the test questions (16 in our case) that the crowdsourcing platform allows us to introduce in order to test annotators' accuracy.
(II) Classification. Additionally, given the semantic distinction between the vertical and horizontal axis noted above (Sect. 3.2.1), we consider the classification problem of predicting above/below relative locations. That is, if the predicted y-coordinate for the Object center O c y falls below the y-coordinate of the Subject center S c y and the actual Object center O c y is below the Subject center S c y , we count it as a correct prediction, and as incorrect otherwise. Likewise for above predictions. We compute both macro-averaged 11 accuracy (acc y ) and macroaveraged F1 (F1 y ) metrics.
(III) Intersection over Union (IoU). We consider the bounding box overlap (IoU) from the VOC detection task (Everingham et al., 2015): where B O and B O are predicted and ground truth Object boxes respectively. A prediction is counted as correct if the IoU is larger than 50%. Crucially, we notice that our setting and results are not comparable to object detection as we employ text instead of images as input and thus we cannot leverage the pixels to locate the Object, unlike in detection.

Similarity Task
Following standard practices (Pennington et al., 2014), the performance of the predictions of (cosine) similarity from the embeddings (described in Sect. 3.3) is evaluated with the Spearman correlation ρ against the crowdsourced human ratings.

Results and Discussion
We consider the notation of the methods from Sect. 4.4 and the evaluation subsets described in Sect. 4.2 for the Prediction task. To test statistical significance we employ a Friedman rank test and post hoc Nemeny tests on the results of the 10 folds. Table 2 shows that the INI and RND 12 methods perform similarly in the Original test set, arguably be-cause a large part of the learning takes place in the parameters of the layers subsequent to the embedding layer. However, in the next section we show that this is no longer the case when unseen words are present. We also observe that the one-hot embeddings NU-1H perform slightly better than the rest of methods when no unseen words are present (Tab. 2 and Tab. 3 right).  It is also worth noting that the results of the Prediction task are, in fact, conservative. First, the Original test data contains a considerable number of meaningless (e.g., (giraffe, a, animal)), and irrelevant combinations (e.g., (clock, has, numbers) or (sticker, identifies, apple)). Second, even when only meaningful examples are considered, we are inevitably penalizing for plausible predictions. For instance, in (man, watching, man) we expect both men to be reasonably separated on the x-axis yet the one with the highest y coordinate is generally not predictable as it depends on their height and their distance to the camera. This yields above/below classification performance and correlations. Regardless, all methods (except rand-pred) exhibit reasonably high performance in all measures. Table 3 evidences that both visual and linguistic embeddings (INI vis and INI lang ) significantly outperform their random-embedding counterparts RND by a large margin when unseen words are present. The improvement occurs for both, updated (U) and non-updated (NU) embeddings-although it is expected that the updated methods perform slightly worse than the non-updated ones since the original embeddings will have "moved" during training and therefore an unseen embedding (which has not been updated) might no longer be close to other semantically similar vectors in the updated space.

Evaluation on Unseen Words
Besides statistical significance, it is worth mentioning that the INI methods consistently outperformed both their RND counterparts and NU-1H in each of the 10 folds (not shown here) by a steadily large margin. In fact, results are markedly stable across folds, in part due to the large size of the training and test sets (> 0.9M and > 120K examples respectively). Additionally, to ensure that "unseen" results are not dependent on our particular list of objects, we repeated the experiment with two additional lists of randomly selected objects, obtaining very similar results.
Remarkably, the INI methods experience only a small performance drop under unseen conditions (Tab. 3, left) compared to when we allow them to train with these words (Tab. 3, right), and this difference might be partially attributed to the reduction of the training data under "unseen" conditions, where at least 10% of the training data are left out.
Altogether, these results on unseen words show that semantic and visual similarities between concepts, as encoded by word and visual embeddings, can be leveraged by the model in order to predict spatial knowledge about unseen words. 13

Qualitative Insight
Visual inspection of model predictions is instructive in order to gain insight on the spatial informativeness of visual and linguistic representations on unseen words. Figure 2 shows heat maps of low (black) and high (white) probability regions for the objects. The "heat" for the Object is assumed to be normally distributed with mean (µ) equal to the predicted Object center O c and standard deviation (σ) equal to the predicted Object size O b (assuming independence of the x and y components, which yields the product of two Gaussians, one for each component x and y). The "heat" for the Subject is computed similarly, although with µ and σ equal to the  actual Subject center S c and size S b , respectively. The INI methods in Figure 2 illustrate the contribution of the embeddings to the spatial understanding of unseen objects. In general, both visual and linguistic embeddings enabled predicting meaningful spatial arrangements, yet for the sake of space we have only included three examples where: vis performs better than lang (third column), where lang performs better than vis (second column), and where both perform well (first column). We notice that the embeddings enable the model to infer that e.g., since "camera" (unseen) is similar to "camcorder" (seen at training time), both must behave spatially similarly. Likewise, the embeddings enable predicting correctly the relative sizes of unseen objects. We also observe that when the embeddings are not informative enough, model predictions become less accurate. For instance, in NU-INI lang , some unrelated objects (e.g., "ipod") have embeddings similar to "apple", and analogously for NU-INI vis and "tail". We finally notice that predictions on unseen objects using random embeddings (RND) are markedly bad. Task   Table 4 shows the results of evaluating the embeddings, including those learned in the Prediction task, against the human ratings of spatial similarity (Sect. 3.3). Hence, only the "updated" methods (U) are shown and we additionally in-clude the concatenation of visual and linguistic embeddings CONC     spatial properties of objects. In particular, linguistic features seem to be more spatially informative than visual features.

Spatial Similarity
Crucially, we observe a significant improvement of the U-INI vis over the original visual vectors (VGG-128) (p < 0.05) and of the U-INI lang over the original linguistic embeddings (GloVe) (p < 0.05), which evidence the effectiveness of training in the Prediction task as a method to further specialize embeddings in spatial knowledge. It is worth mentioning that these improvements are consistent in each of the 10 folds (not shown here) and markedly stable (see standard errors in Tab. 4).
We additionally observe that the concatenation of visual and linguistic embeddings CONC GloVe+VGG-128 outperforms all unimodal embeddings by a margin, suggesting that the fusion of visual and linguistic features provides a more complete description of spatial properties of objects. Remarkably, the improvement is even larger for the concatenation of the embeddings updated during training CONC U-INI lang +U-INI vis , which obtains the highest performance overall. Figure 3 illustrates the progressive specialization of our embeddings in spatial knowledge as we train them in our Prediction task. We notice that all embeddings improve, yet U-INI lang seem to worsen their quality when we over-train them-likely due to overfitting, as we do not use any regularizer besides early stopping. We also observe that although the random embeddings (RND) are the ones that benefit the most from the training, their performance is still far from that of U-INI vis and U-INI lang , suggesting the importance of visual and linguistic features to represent spatial properties of objects. It is relevant to mention that in a pilot study we crowdsourced a different list of 1,016 object pairs where we employed 3 instead of 5 annotators per row. Results stayed remarkably consistent with those presented here-the improvement for the updated embeddings was in fact even larger.
Limitations of the current approach and future work In order to keep the design clean in this first paper on distributed spatial representations we employ a fully supervised setup. However, we notice that methods to automatically parse images (e.g., object detectors) and sentences are available.
A second limitation is the 2D simplification of the actual 3D world that our approach and the current spatial literature generally employs. Even though methods that infer 3D structure from 2D images exist, this is beyond the scope of this paper which shows that a 2D treatment already enhances the learned spatial representations. It is also worth noting that the proposed regression setting trivially generalizes to 3D if suitable data are available, and in fact, we believe that the learned representations could further benefit from such extension.

Conclusions
Altogether, this paper sheds light on the problem of learning distributed spatial representations of objects. To learn spatial representations we have leveraged the task of predicting the continuous 2D relative spatial arrangement of two objects under a relationship, and a simple embedding-based neural model that learns this task from annotated images. In the same Prediction task we have shown that both word embeddings and CNN features endow the model with great predictive power when is presented with unseen objects. Next, in order to assess the spatial content of distributed representations, we have collected a set of 1,016 object pairs rated by spatial similarity. We have shown that both word embeddings and CNN features are good predictors of human spatial judgments. More specifically, we find that word embeddings (ρ = 0.535) tend to perform better than visual features (ρ ∼ 0.46), and that their combination (ρ ∼ 0.6) outperforms both modalities separately. Crucially, in the same ratings we have shown that by training the embeddings in the Prediction task we can further specialize them in spatial knowledge, making them more akin to human spatial judgments. To benchmark the task, we make the Similarity dataset and our trained spatial representations publicly available. 14 Lastly, this paper contributes to the automatic un-14 https://github.com/gcollell/spatial-representations derstanding of spatial expressions in language. The lack of common sense knowledge has been recurrently argued as one of the main reasons why machines fail at exhibiting more "human-like" behavior in tasks (Lin and Parikh, 2015). Here, we have provided a means of compressing and encoding such common sense spatial knowledge about objects into distributed representations, further showing that these specialized representations correlate well with human judgments. In future work, we will also explore the application of our trained spatial embeddings in extrinsic tasks in which representing spatial knowledge is essential such as robot navigation or robot understanding of natural language commands (Guadarrama et al., 2013;Moratz and Tenbrink, 2006). Robot navigation tasks such as assisting people with special needs (blind, elderly, etc.) are in fact becoming increasingly necessary (Ye et al., 2015) and require great understanding of spatial language and spatial connotations of objects.