Representation Learning for Grounded Spatial Reasoning

The interpretation of spatial references is highly contextual, requiring joint inference over both language and the environment. We consider the task of spatial reasoning in a simulated environment, where an agent can act and receive rewards. The proposed model learns a representation of the world steered by instruction text. This design allows for precise alignment of local neighborhoods with corresponding verbalizations, while also handling global references in the instructions. We train our model with reinforcement learning using a variant of generalized value iteration. The model outperforms state-of-the-art approaches on several metrics, yielding a 45% reduction in goal localization error.


Introduction
Understanding spatial references in natural language is essential for successful human-robot communication and autonomous navigation. This problem is challenging because interpretation of spatial references is highly context-dependent. For instance, the instruction "Reach the cell above the westernmost rock" translates into different goal locations in the two environments shown in Figure 1. Therefore, to enable generalization to new, unseen worlds, the model must jointly reason over the instruction text and environment configuration. Moreover, the richness and flexibility in verbalizing spatial references further complicates interpretation of such instructions.
Reach the cell above the westernmost rock Figure 1: Sample 2D worlds and an instruction describing a goal location. The optimal path from a common start position, denoted by a white dashed line, varies considerably with changes in the map layout.
In this paper, we explore the problem of spatial reasoning in the context of interactive worlds. Specifically, we assume access to a simulated environment, in which an agent can take actions to interact with the world and is rewarded for reaching the location specified by the language instruction. This feedback is the only source of supervision the model uses for interpreting spatial references.
The key modeling task here is to induce a representation that closely ties environment observations and linguistic expressions. In prior work, this issue was addressed by learning representations for each modality and then combining them, for instance, with concatenation (Misra et al., 2017). While this approach captures high-level correspondences between instructions and maps, it does not encode de-tailed, lower-level mappings between specific positions on the map and their descriptions. As our experiments demonstrate, combining the language and environment representations in a spatially localized manner yields significant performance gains on the task.
To this end, our model uses the instruction text to drive the learning of the environment representation. We start by converting the instruction text into a realvalued vector using a recurrent neural network with LSTM cells (Hochreiter and Schmidhuber, 1997). Using this vector as a kernel in a convolution operation, we obtain an instruction-conditioned representation of the state. This allows the model to reason about immediate local neighborhoods in references such as "two cells to the left of the triangle". We further augment this design to handle global references that involve information concerning the entire map (e.g. "the westernmost rock"). This is achieved by predicting a global value map using an additional component of the instruction representation. The entire model is trained with reinforcement learning using the environmental reward signal as feedback.
We conducted our experiments using a 2D virtual world as shown in Figure 1. Overall, we created over 3,300 tasks across 200 maps, with instructions sourced from Mechanical Turk. We compare our model against two state-of-the-art systems adapted for our task (Misra et al., 2017;Schaul et al., 2015).
The key findings of our experiments are threefold. First, our model can more precisely interpret instructions than baseline models and find the goal location, yielding a 45% reduction in Manhattan distance error over the closest competitor. Second, the model can robustly generalize across new, unseen map layouts. Finally, we demonstrate that factorizing the instruction representation enables the model to sustain high performance when handling both local and global references.

Related Work
Spatial reasoning in text This topic has attracted both theoretical and practical interest. From the linguistic and cognitive perspectives, research has focused on the wide range of mechanisms that speakers use to express spatial relations (Tenbrink, 2007;Viethen and Dale, 2008;Byrne and Johnson-Laird, 1989;Li and Gleitman, 2002). The practical implications of this research are related to autonomous navigation (Moratz and Tenbrink, 2006;Levit and Roy, 2007;Tellex et al., 2011) and human-robot interaction (Skubic et al., 2004).
Previous computational approaches include techniques such as proximity fields (Kelleher et al., 2006), spatial templates (Levit and Roy, 2007) and geometrically defined mappings (Moratz and Tenbrink, 2006;Kollar et al., 2010). More recent work in robotics has integrated text containing position information with spatial models of the environment to obtain accurate maps for navigation (Walter et al., 2013;Hemachandra et al., 2014). Most of these approaches typically assume access to detailed geometry or other forms of domain knowledge. In contrast to these knowledge-rich approaches, we are learning spatial reference via interaction with the environment, acquiring knowledge of the environment in the process.
Instruction following Spatial reasoning is a common element in many papers on instruction following (MacMahon et al., 2006;Vogel and Jurafsky, 2010;Chen and Mooney, 2011;Artzi and Zettlemoyer, 2013;Kim and Mooney, 2013;Andreas and Klein, 2015). As a source of supervision, these methods assume access to demonstrations, which specify the path corresponding with provided instructions. In our setup, the agent is only driven by the final rewards when the goal is achieved. This weaker source of supervision motivates development of new techniques not considered in prior work.
More recently, Misra et al. (2017) proposed a neural architecture for jointly mapping instructions and visual observations (pixels) to actions in the environment. Their model separately induces text and environment representations, which are concatenated into a single vector that is used to output an action policy. While this representation captures coarse correspondences between the modalities, it doesn't encode mappings at the level of local neighborhoods, negatively impacting performance on our task.
Universal value functions The idea of generalized value functions has been explored before in Schaul et al. (2015). The technique, termed UVFA, 50 presents a clever trick of factorizing the value function over states and goals using singular value decomposition (SVD) and then learning a regression model to predict the low-rank vectors. This results in quick and effective generalization to all goals in the same state space. However, their work stops short of exploring generalization over map layouts, which our model is designed to handle. Furthermore, our setup also involves specifying goals using natural language instructions, which is different from the coordinate-style specification used in that work.

General Framework
Task setup We model our task as a Markov Decision Process (MDP), where an autonomous agent is placed in an interactive environment with the capability to choose actions that can affect the world. A goal is described in text, and rewards are available to the agent correspondingly. The MDP can be represented by the tuple S, A, X, T, R , where S is the set of all possible state configurations, A is the set of actions available to the agent, X is the set of all goal specifications 2 in natural language, T (s |s, a, x) is the transition distribution, and R(s, x) is the reward function. A state s ∈ S includes information such as the locations of different entities along with the agent's own position. In this work, T is deterministic in the environments considered; however, our methods also apply in the stochastic case.
Text instructions Prior work has investigated human usage of different types of referring expressions to describe spatial relations (Levinson, 2003;Viethen and Dale, 2008). In order to build a robust instruction following system, we examine several categories of spatial expressions that exhibit the wide range of natural language goal descriptions. Specifically, we consider instructions that utilize objects/entities present in the environment to describe a goal location. These instructions can be categorized into three groups: (a) Text referring to a specific entity (e.g., "Go to the circle").
(b) Text specifying a location using a single refer-2 We will use the terms goal specifications and instructions interchangeably. ent entity (e.g., "Reach the cell above the westernmost rock").
(c) Text specifying a location using multiple referent entities (e.g., "Move to the goal two squares to the left of the heart and top right of house").
These three categories exemplify an increasing level of complexity, with the last one having multiple levels of indirection.
In each category, we have both local and global references to objects. Local references require an understanding of spatial prepositional phrases such as 'above', 'in between' and 'next to' in order to determine the precise goal location. This comprehension is invariant to the global position of the object landmark(s) provided in the instruction. A global reference, on the other hand, contains superlatives such as 'easternmost' and 'topmost', which require reasoning over the entire map. For example, in the case of (a) above, a local reference would describe a unique object 3 (e.g., "Go to the circle"), whereas a global reference might require comparing the positions of all objects of a specific type (e.g., "Go to the northernmost tree").
A point to note is that we do not assume any access to mapping from instructions to objects or entities in the worlds or a knowledge of spatial ontology -the system has to learn this entirely through feedback from the environment.
Generalized Value Iteration Learning to reach the goal while maximizing cumulative reward can be done by using a value function V (s) (Sutton and Barto, 1998) which represents the agent's notion of expected future reward from state s. A popular algorithm to learn an optimal value function is Value Iteration (VI) (Bellman, 1957), which uses the technique of dynamic programming.
In the standard Bellman equation, the value function is dependent solely on state. Schaul et al. (2015) proposed a value function V (s, g) describing the expected reward from being in state s given goal g, capturing that state values are goal-dependent and that a single environment can offer many such goals. We also make use of such a generalized value function, although our goals are not observed directly as Figure 2: A schematic depiction of our model. Text instructions are represented as a vector h(t) and states as embeddings φ(s). A portion of the text representation is used as a convolutional kernel on φ(s), giving a text-conditioned local state representation z 1 . The remaining components are used as coefficients in a linear combination of gradient functions to give a global map-level representation z 2 . z 1 and z 2 are concatenated and input to a convolutional neural network to predict the final value map. coordinate locations or states themselves but rather described in natural language. With x denoting a textual description of a goal, our VI update equations are: where Q is the action-value function, tracking the value of choosing action a in state s. Once an optimal value function is learned, a straightforward action policy is: (2) π(s, x) = arg max a Q(s, a, x)

Model
Generalization over both environment configurations and text instructions requires a model that meets two desiderata. First, it must have a flexible representation of goals, one which can encode both the local structure and global spatial attributes inherent to natural language instructions. Second, it must be compositional; the representation of language should be generalizable even though each unique instruction will only be observed with a single map during training. Namely, the learned representation for a given instruction should still be useful even if the objects on a map are rearranged or the layout is changed entirely.
To that end, our model combines the textual instructions with the map in a spatially localized manner, as opposed to prior work which joins goal representations and environment observations via simpler functions like an inner product (Schaul et al., 2015). While our approach can more effectively learn local relations specified by language, it cannot naturally capture descriptions at the global environment level. To address this problem, we also use the language representation to predict coefficients for a basis set of gradient functions which can be combined to encode global spatial relations.
More formally, inputs to our model (see Figure 2) consist of an environment observation s and textual description of a goal x. For simplicity, we will assume s to be a 2D matrix, although the model can easily be extended to other input representations. We first convert s to a 3D tensor by projecting each cell to a low-dimensional embedding (φ) as a function of the objects contained in that cell. In parallel, the text instruction x is passed through an LSTM recurrent neural network (Hochreiter and Schmidhuber, 1997) to obtain a continuous vector representation h(x). This vector is then split into local and , is reshaped into a kernel to perform a convolution operation on the state embedding φ(s) (similar to Chen et al. (2015)): (3) Meanwhile, the three-element global component 52 Algorithm Observe next state s t and reward r t 10: Store trajectory (s=s 0 ,s 1 ,...,r=r 0 ,r 1 ,...) in D 11: for j=1,J do 12: Sample random trajectory (s,r) from D

13:
Perform gradient descent step on loss L(θ) h 1 (x) is used to form the coefficients for a vertical and horizontal gradient along with a corresponding bias term. 4 The gradients, denoted G 1 and G 2 in Figure 2, are matrices of the same dimensionality as the state observation with values increasing down the rows and along the columns, respectively. The axis-aligned gradients are weighted by the elements of h 1 (x) and summed to give a final global gradient spanning the entire 2D space, analogous to how steerable filters can be constructed for any orientation using a small set of basis filters (Freeman and Adelson, 1991): (4) in which J is the all-ones matrix also of the same dimensionality as the observed map.
Finally, the local and global information maps are concatenated into a single tensor, which is then processed by a convolutional neural network (CNN) with parameters θ to approximate the generalized value function: for every state s in the map.
Reinforcement Learning Given our model'ŝ V (s, x) predictions, the resulting policy (Equation 2) can be enacted, giving a continuous trajectory of states {s t , s t+1 , . . .} on a single map and their associated rewards {r t , r t+1 , . . .} at each timestep 4 Note that we are referring to gradient filters here, not the gradient calculated during backpropagation in deep learning. t. We stored entire trajectories (as opposed to state transition pairs) in a replay memory D as described in Mnih et al. (2015). The model is trained to produce an accurate value estimate by minimizing the following objective: where s is a state sampled from D, γ is the discount factor, Θ is the set of parameters of the entire model, and Θ − is the set of parameters of a target network copied periodically from our model. The complete training procedure is shown in Algorithm 1.

Experimental Setup
Puddle world navigation data In order to study generalization across a wide variety of environmental conditions and linguistic inputs, we develop an extension of the puddle world reinforcement learning benchmark (Sutton, 1996;Mankowitz et al., 2016). States in a 10 × 10 grid are first filled with either grass or water cells, such that the grass forms one connected component. We then populate the grass region with six unique objects which appear only once per map (triangle, star, diamond, circle, heart, and spade) and four non-unique objects (rock, tree, horse, and house) which can appear any number of times on a given map. See Figure 1 for an example visualization.

Split Local Global
Train 1566 1071 Test 399 272 Goal positions are chosen uniformly at random from the set of grass cells, encouraging the use of spatial references to describe goal locations which do not themselves contain a unique object. We used the Mechanical Turk crowdsourcing platform (Buhrmester et al., 2011) to collect natural language descriptions of these goals. Human annotators were asked to describe the positions of these goals using surrounding objects. At the end of each trial, we asked the same participants to provide goal locations given their own text instructions. This helped filter out a majority of instructions that were ambiguous or ill-specified. Table 1 provides some statistics on the data, and Figure 3 shows example annotations. In total, we collected 3308 instructions, ranging from 2 to 43 words in length, describing over 200 maps. There are 361 unique words in the annotated instructions. We do not perform any preprocessing on the raw annotations. It is plausible that a model designed to handle only local references could not handle global ones (consider our own model without the global gradient maps). For clearer interpretation of results, we evaluate our model in two modes: trained and tested on local and global data separately, or as a combined dataset. While local instructions were obtained easily, the global instructions were collected by designing a task in which only nonunique objects were presented to the annotators. 5 This precluded simple instructions like "go left of the object " because there would always be more than one of each object type. Therefore, we obtained text with global properties (e.g. middle rock, leftmost tree) to sufficiently pinpoint an object. On average, we collected 31 unique local instructions and 10 unique global instructions per map.
To quantify the diversity of our dataset, we find the five nearest instructions in the training set for every instruction in the test set, as measured by edit distance (using the word as a unit) normalized by test instruction length. For each of these pairs, we also measure the Manhattan distance between their corresponding goal locations. Figure 4, which visu- alizes this analysis, underscores the difficulty of this task; even when two instructions are highly similar, they might correspond to entirely different target locations. This is the case in the example in Figure 1, which has a distance of four between the references goals.
Baselines We compare our model to the following baselines: UVFA (text) is a variant of the model described in (Schaul et al., 2015) adapted for our task. The original model made use of two MLPs to learn low dimensional embeddings of states and goals which were then combined via dot product to give value estimates. Goals were represented either as (x, y) coordinates or as states themselves. As our goals are not observed directly but described in text, we replace the goal MLP with the same LSTM as in our model. The state MLP has an identical architecture to that of the UVFA: two hidden layers of dimension 128 and ReLU activations. For consistency with the UVFA, we represent states as binary vectors denoting the presence of each type of object at every position.
CNN + LSTM is a variant of the model described in Misra et al. (2017), who developed it for a language-grounded block manipulation task. It first convolves the map layout to a low-dimensional rep- resentation (as opposed to the MLP of the UVFA) and concatenates this to the LSTM's instruction embedding (as opposed to a dot product). These concatenated representations are then input to a twolayer MLP.
We also perform analysis to study the representational power of our model, introducing two more comparison models: UVFA (pos) is the original UVFA model from (Schaul et al., 2015), which we evaluate on our modified puddle worlds to determine the difficulty of environment generalization independently from instruction interpretation.
Our model (w/o gradient) is an ablation of our model without the global gradient maps, which allows us to determine the gradients' role in representation-building.
In additional to our reinforcement learning experiments, we train these models in a supervised setting to isolate the effects of architecture choices from other concerns inherent to reinforcement learning algorithms. For this purpose, we constructed a dataset of ground-truth value maps for all humanannotated goals using value iteration. We use the models to predict value maps for the entire grid and minimize the mean squared error (MSE) compared to the ground truth values:

Implementation details
Our model implementation uses an LSTM with a learnable 15-dimensional embedding layer, 30 hidden units, 8-dimensional embeddings φ(s), and a 3x3 kernel applied to the embeddings, giving a dimension of 72 for h 2 (t). The final CNN has layers of {3, 6, 12, 6, 3, 1} channels, all with 3x3 kernels and padding of length 1 such that the output value map prediction is equal in size to the input observation. For each map, a reward of 3 is given for reaching the correct goal specified by human annotation and a reward of −1 is given for falling in a puddle cell. The only terminal state is when the agent is at the goal. Rewards are discounted by a factor of 0.95. We use Adam optimization (Kingma and Ba, 2015) for training all models.

Results
We present empirical results on two different datasets -our annotated puddle world and an existing block navigation task (Bisk et al., 2016).

Puddle world navigation
Comparison with the state-of-the-art We first investigate the ability of our model to learn solely from environment simulation. Figure 5 shows the discounted reward achieved by our model as well as the two baselines for both instruction types. In both experiments, our model is the only one of the   Table 3: Performance on a test set of environments and instructions after supervised training. Lower is better for MSE and Manhattan distance; higher is better for policy quality. The gradient basis significantly improves the reconstruction error and goal localization of our model on global instructions, and expectedly does not affect its performance on local instructions. three to achieve an average nonnegative reward after convergence (0.88 for local instructions and 0.49 for global instructions), signifying that the baselines do not fully learn how to navigate through these environments. Following Schaul et al. (2015), we also evaluated our model using the metric of policy quality. This is defined as the expected discounted reward achieved by following a softmax policy of the value predictions. Policy quality is normalized such that an optimal policy has a score of 1 and a uniform random policy has a score of 0. Intuitively, policy quality is the true normalized expectation of score over all maps in the dataset, instructions per map, and start states per map-instruction pair. Our model outperforms both baselines on this metric as well on the test maps (Table 2). We also note that the perfor-mance of the baselines flip with respect to each other as compared to their performance on the training maps ( Figure 5). While the UVFA variant learned a better policy on the train set, it did not generalize to new environments as well as the CNN + LSTM.
Finally, given the nature of our environments, we can use the predicted value maps to infer a goal location by taking the position of the maximum value. We use the Manhattan distance from this predicted position to the actual goal location as a third metric. The accuracy of our model's goal predictions is more than twice that of the baselines on local references and roughly 45% better on global references.
Analysis of learned representations For the representation analysis in a supervised setting, we compared the predicted value maps of all models against the unseen test split of maps. Table 3 shows the results of this study. As expected, our model without the global gradient performs no differently from the full model on local references, but has higher MSE and average distances to true goal than the full model on global references. We also note that UVFA (pos) performs much worse than both CNN+LSTM and our model, showing the difficulty of environment generalization even when the goals are observed directly. (The original UVFA paper (Schaul et al., 2015) demonstrated effective generalization over goal states within a single environment.) Surprisingly, our model trained via reinforcement learning has more precise goal location predictions (as measured via Manhattan distance) than when trained on true state values in a supervised manner. However, the MSE of the value predictions are much higher in the RL setting (e.g., 0.80 vs 0.25 for supervised on local instructions). This shows that despite the comparative stability of the supervised setting, minimization of value prediction error does not nec-essarily lend itself to the best policy or goal localization. Conversely, having a higher MSE does not always imply a worse policy, as seen also in the performance of the two UVFA variants in Table 3.
Generalization One of the criteria laid out for our model was its ability to construct language representations and produce accurate value maps, independent of layouts and linguistic variation. Figure 6 provides examples of two layouts, each with two different instructions. In the first map (top), we have both instructions referring to the same location. Our model is able to mimic the optimal value map accurately, while the other baselines are not as precise, either producing a large field of possible goal locations (CNN+LSTM) or completely missing the goal (UVFA-text).
On the vertical axis, we observe generalization across different maps with the same instructions. Our model is able to precisely identify the goals in each scenario in spite of significant variation in their locations. This proves harder for the other representations.
Although our model is compositional in the sense that it transfers knowledge of spatial references between different environments, some types of instructions do prove challenging. We identify two of the poorest predictions in Figure 7. We see that multiple levels of indirection (as in 7a, which references a location relative to an object relative to another object) or unnecessarily long instructions (as in 7b, which uniquely identifies a position by the eighth token but then proceeds with redundant information) are still a challenge.
Learning curve Due to the manual effort that comes with constructing a dataset of human annotations, it is also important to consider the sampleefficiency of a model. Figure 8 shows the quality policy and prediction error on local instructions as a function of training set size. Our model reaches 0.90 policy quality with only 400 samples, demonstrating efficient generalization capability.    (Bisk et al., 2016). Our model once again outperforms the baselines, although all models have a lower policy quality on this dataset than on our own.

ISI Grounding Dataset
We also evaluate our model on the ISI Language Grounding dataset (Bisk et al., 2016), which contains human-annotated instructions describing how to arrange blocks identified by numbers and logos.
Although it does not contain variable environment maps as in our dataset, it has a larger action space and vocabulary. The caveat is that the task as posed in the original dataset is not compatible with our model. For a policy to be derived from a value map with the same dimension as the state observation, it is implicitly assumed that there is a single controllable agent, whereas the ISI set allows multiple blocks to be moved. We therefore modify the ISI setup using an oracle to determine which block is given agency during each step. This allows us to The MSE of the value map prediction as a function of a subgoal's ordering in an overall task. The model performs better on subgoals later in a task despite the subgoals being treated completely independently during both training and testing.
retain the linguistic variability of the dataset while overcoming the mismatch in task setup. The states are discretized to a 13 × 13 map and the instructions are lemmatized. Performance on the modified ISI dataset is reported in Table 4 and representative visualizations are shown in Figure 9. Our model outperforms both baselines by a greater margin in policy quality than on our own dataset. Misra et al. (2017) also use this dataset and report results in part by determining the minimum distance between an agent and a goal during an evaluation lasting N steps. This evaluation metric is therefore dependent on this timeout parameter N . Because we discretized the state space so as to be able to represent it as a grid of embeddings, the notion of a single step has been changed and direct comparison limited to N steps is ill-defined. 6 Hence, due to modifica-6 When a model is available and the states are not overwhelmingly high-dimensional, policy quality is a useful metric that is independent of this type of parameter. As such, it is our tions in the task setup, we cannot compare directly to the results in Misra et al. (2017).
Understanding grounding evaluation An interesting finding in our analysis was that the difficulty of the language interpretation task is a function of the stage in task execution (Figure 9(d)). In the ISI Language Grounding set (Bisk et al., 2016), each individual instruction (describing where to move a particular block) is a subgoal in a larger task (such as constructing a circle with all of the blocks). The value maps predicted for subgoals occurring later in a task are more accurate than those occurring early in the task. It is likely that the language plays a less crucial role in specifying the subgoal position in the final steps of a task. As shown in Figure 9(a), it may be possible to narrow down candidate subgoal positions just by looking at a nearly-constructed highdefault metric here. However, estimating policy quality for environments substantially larger than those investigated here is a challenge in itself. 59 level shape. In contrast, this would not be possible early in a task because most of the blocks will be randomly positioned. This finding is consistent with a result from Branavan et al. (2011), who reported that strategy game manuals were useful early in the game but became less essential further into play. It appears to be part of a larger trend that the marginal benefit of language in such grounding tasks can vary predictably between individual instructions.

Conclusions
We have described a novel approach for grounded spatial reasoning. Combining the language representation in a spatially localized manner allows for increased precision of goal identification a nd improved performance on unseen environment configurations. Alongside our models, we present Puddle World Navigation, a new grounding dataset for testing the generalization capacity of instructionfollowing algorithms in varied environments.