Adapting to Unseen Environments through Explicit Representation of Context

In order to deploy autonomous agents to domains such as autonomous driving, infrastructure management, health care, and finance, they must be able to adapt safely to unseen situations. The current approach in constructing such agents is to try to include as much variation into training as possible, and then generalize within the possible variations. This paper proposes a principled approach where a context module is coevolved with a skill module. The context module recognizes the variation and modulates the skill module so that the entire system performs well in unseen situations. The approach is evaluated in a challenging version of the Flappy Bird game where the effects of the actions vary over time. The Context+Skill approach leads to significantly more robust behavior in environments with previously unseen effects. Such a principled generalization ability is essential in deploying autonomous agents in real world tasks, and can serve as a foundation for continual learning as well.


INTRODUCTION
Generalization to unseen situations is an important capability for autonomous agents.Especially in real-world decision making and control applications such as autonomous driving, robotics, process control, health care, and finance, the agents routinely need to adapt safely to unseen situations.A common practice is to train these models, mostly deep neural networks, with the data collected from a limited number of hand-designed scenarios.However, the tasks are often too complex to anticipate every possible scenario, and this approach is not scalable.Moreover, these models can be brittle when they are exposed to even small variations or noise.
One popular approach to address this problem is few-shot learning, in particular metalearning, either by utilizing gradients [7,21,25] or evolutionary procedures [6,10].In metalearning, systems are trained by exposing them to a large number of tasks, and then tested for their ability to learn new relevant but unseen tasks.There are also a number of approaches mostly for supervised learning setting where new labels need to be predicted based on limited number of training data.However, applications in control and decision making, including reinforcement learning problems, are very limited [14].
The approach in this paper is motivated by prior work on opponent modeling in poker [16,17].In that domain, an effective approach was to evolve one neural network, the game module, to decide what move to make, and another, the opponent module, to monitor the opponent, and modulate those decisions by taking the opponents playing style into account.When trained with only a small number of very simple but different opponents, the approach was able to generalize and play well against a wide array of opponents, include some that were much better than anything seen during training.
In a sense, the opponent forms a context for the decision making in poker.Each decision needs to take into account how the opponent is likely to respond, and select the right action accordingly.The player can thus adapt to many different game playing situations immediately, even those that have not been encountered before.In this paper, this approach is generalized and applied to control and decision making more broadly.In more general terms, a skill network reacts to the current situation, and a context network integrates observations over a longer time period.A third, controller, network combines the outputs of both networks, thus modulating decision making through context.Such a Context+Skill system can thus generalize to more situations than any of its components alone.
The Context+Skill approach is evaluated in this paper on an extended version of the popular Flappy Bird game.This version includes more actions and physical effects (i.e.forward flap and drag in addition to flap up and gravity).Such an extension allows generating a range of unseen scenarios both by extending the range of effects of those actions as well as their combinations.The approach generalizes remarkably well to new situations, and does so much better than its components alone.Context+Skill approach is thus a promising approach for building robust autonomous agents in real-world domains.
The remaining sections of this paper are organized as follows: Section 2 presents the experimental set up and the test domain, the architecture of the neural networks, and the multiobjective evolution procedure for constructing the system.Section 3 presents learning and generalization results, demonstrating that the Con-text+Skill approach indeed performs better than its parts.The behaviors of these networks are contrasted in Section 4, finding that Context+Skill can anticipate results of its actions more accurately, making it possible to adapt to unseen situations.

METHODOLOGY
This section introduces the experimental setup, the neural networks used as the control policies for the agent, and the evolutionary training methodology.
Figure 1: A scene from the Flappy Ball game.The red circle represents the agent and the white columns are pipes that move from right to left as the game progresses.At each time step, the agent can flap up or forward or both; if it does not, gravity will pull it down and drag will slow it down.The origin of the coordinate system (0,0) is at the upper left corner; thus the action of flapping up results in values of negative yvelocity or smaller y-position.The six variables identified in the figure constitute the input information that the agent receives at each time step.This domain is more complex than the common Flappy Bird game, which does not have the forward flap action or drag.In order to pass more pipes without a collision, the agent needs to use the forward flapping action carefully because the only way to slow down is through drag.It also needs to be cautious because it can only observe the closest pipe.By changing the effects of actions and the forces of gravity and drag, new and more challenging situations can be created, testing the generalization performance of the agent's control policy.

The Flappy Ball Domain
Flappy Ball is an extension of the popular Flappy Bird computer game [1].Implemented in PyGame, it has less detailed visual effects but more complex physical dynamics, and is mainly developed to test the generalization behavior of an agent in a more challenging and controlled environment (Fig. 1).The agent, controlled by a neural network, aims to navigate through the openings between pipes without hitting them for a certain length of time.The Agent can control two actions, i.e., flapping forward and upward; both actions can be applied simultaneusly.If they are not applied, gravity will pull the agent down and drag will slow it down.The agent gets a reward of +1 every time it passes a pipe successfully, and various penalties depending how badly it crashes into the pipes, ceiling, or ground.Each time step spent in a collision incurs a penalty of -1, and in hitting the ceiling or the ground, of -5.This way, attempting to fly through the pipes but failing is penalized less than flying through a pipe or not trying.
At every time step, the agent receives sensory information as a vector of six numerical values as indicated in Fig. 1: the vertical position of the agent (y), its horizontal and vertical velocities (v x and v y , respectively), the horizontal distance of the agent to the right edge of the closest pipe (x), and the height of the top and bottom pipes (h top and h bot tom , respectively).These values are normalized to the range [0,1].In an environment with known physical effects, this setup is a Markov Decision Process (MDP) since all the state information necessary to decide on the right action is provided to the agent.However, the effect of the agent's actions, i.e. flap up or forward, as well as the physical forces acting upon the agent, i.e. gravity and drag, can change between episodes unbeknownst to the agent, establishing a new task for the agent.Therefore, in order to perform well in new tasks, the agent has to infer such variations from its interactions with the environment over time, which makes the problem partially observable.Since acceleration and velocity are linearly correlated, such learning is possible.
The Flappy Ball domain can be seen as a proxy for control and decision making problems where the changes in the environment require immediate adaptation, such as operating a vehicle under different weather conditions, configuration changes, wear and tear, or sensor malfunctions.The challenge is to adapt the existing policies to the new conditions immediately without further training, i.e. to generalize the known behavior to unseen situations.

Evolutionary Multi-objective Optimization (EMO)
The original Flappy Bird game is usually treated as a single-objective optimization problem, where the number of pipes passed until one is hit is maximized.To provide for more varied behaviors, Flappy Ball is formulated as a multi-objective optimization problem instead.
The number of successfully passed pipes (f p ) is maximized, whereas the number of any type of collisions (f h , where h stands for hits) is minimized.
Non-dominated sorting genetic algorithm (NSGA-II) [5] was implemented in DEAP [8] as the EMO method for Flappy Ball.Although finding the safest solution (f h = 0) is the ultimate goal, as in the single-objective case, the diversity resulting from the multiobjective search speeds up training and helps discover wellperforming solutions [15].EMO algorithms use Pareto dominance to sort the solutions into sets of equally preferable solutions (or Pareto fronts).The one containing the non-dominated solutions are called Pareto-optimal set [4]; it is up to the user to select one of them based on his or her needs.In the experiments in this paper, one that is perfectly safe or close to it is usually selected.

Neural Networks
The Context+Skill Network consists of three components: the Skill and the Context modules and the Controller (Figure 2).The first two modules receive sensory information from the environment as numerical values, as described in Section 2.1.They send their output to the Controller, a fully connected feedforward neural network that makes the decisions on which actions to take.
The Skill module is also a fully connected feedforward network.Together with the Controller they form the Skill-only Network, S (Fig. 2(c)).The Skill module used in this study has 10 hidden and  Thus, the cell can learn to retain information from the past, update it, and output it at an appropriate time, thereby making it possible to learn sequential behavior [9,11].
The C-module used in this study consists of an LSTM cell size of 10.The memory of the C-module (h t-1 and c t-1 ) is reset at the beginning of each new task, and accumulated (transferred) across episodes within each task.It can therefore form a representation of how actions affect the environment.The output of the LSTM (h t ) is sent to Controller as the context.Together the C-module and the Controller form the Context-only network, C Fig. 2(b).It serves as a second baseline, allowing integration of observations over time, but without a specific Skill network to map them directly to action recommendations.
The complete Context-Skill Network, CS (Fig. 2(a)) consists of both the Context and Skill modules as well as the Controller network of the same size as in C and S. The motivation behind the CS architecture, i.e. of integrating the Context module into S, is to make it possible for the system to learn to use an explicit context representation to modulate its actions appropriately.The method for discovering these behaviors is discussed next.

Neuroevolution
All three neural network models described in Section 2.3 are evolved using NSGA-II [5].The overall procedure is shown in Algorithm 3. The network architectures remain fixed while their weights are evolved.The goal is to maximize their average score across multiple tasks, where each task is based on different physical parameters of the Flappy Ball environment.The base values for the four actions are chosen as Flap base =-12.0(negative value is due to the coordinate system), Gravity base =1.0,Fwd base =5.0 and Drag base =1.0.In each task during evolution, only one parameter is subject to change, while the rest are fixed at their base values.There are four tasks used in evolution, defined as: • Task-1: The effect of the Flap action varies ∓20% of its base value, i. hits.append(f 1 ) return pipes mean , hits mean every individual in the population is evaluated in parallel on the same task distribution for a fair comparison.Each episode length is fixed to 500 time steps.The seed number for the random number generator (Line-3 in Algorithm 1) is included in the task parameters so that the distribution of the pipes can be repeated.
After the task parameters are prepared, fitness evaluation follows (Algorithm 2).The parameters of a network are stored as an array in the individual candidate and converted to the corresponding neural network representation (Line 3) before the fitness evaluation (Line 9).The memory of C-module in CS is reset at the begining of each task (Lines 5-7), and transferred from episode to episode otherwise.The average number of successfully passed pipes and collisions in each episode are returned as the two objective values to be maximized and minimized, respectively.There are a total of 20 episodes, since there are four tasks with five episodes in each.
The overall procedure, i.e., NSGA-II applied to evolving agents in the Flappy Ball domain, is shown in Algorithm 3. It receives n tasks =4, break n episodes =5, perturb=0.2(i.e., ±20%), Flap base = -12.0,Gravity base = 1.0,Forward base = 5.0, Drag base = 1.0, µ = 96, p crossover = 0.9, n gen = 2,500 as input.The population size (µ) is chosen as a multiple of 24 since the fitness evaluations are distributed among 24 threads on a cluster (i.e., Dell PowerEdge M710, 2x Xeon X5675, 6 core @ 3.06GHz).The details about the genetic operators such as SBX (Simulated Binary Crossover), Polynomial Mutation, and Tournament Selection Based on Dominance can be found in the literature [5].NSGA-II uses (µ + λ) elitist selection strategy with a bias on individuals in lower fronts, where the Pareto-optimal front is the first front.If the individuals are located in the same front, the ones that are more distant from the others in objective space are selected to maintain the diverse set of trade-off solutions within the population.

RESULTS
Evolution as given in Algorithm 3 was run separately for CS, C, and S until an individual was found that achieved a fitness scores of at least f 0 =22.0 and f 1 =0.01,where f 0 is the average number of successfully passed pipes and f 1 is the number of collisions.Although the final Pareto-optimal set in each run contained individuals with higher f 0 values, the minimum f 1 requirement meant that only relatively safe solutions were accepted.Generalization ability of these solutions were then evaluated.

Learning
The evolution of S takes the shortest amount of generations since it has the least number of model parameters to optimize, i.e., 287, compared with 982 for C and 1207 for CS.To make sure the number of parameters was not a factor, another S with a larger Skill module, with the same number of parameters as CS, was also evolved until the same target level.However, it performed poorly compared to the smaller S in the generalization studies, apparently because it was easier to overfit.Thus, it was excluded from the comparisons that follow.

Generalization Behavior
To evaluate the generalization performance of the best performing networks, the task parameters (i.e., flap, gravity, forward, and drag) were changed in the following two ways while keeping the networks fixed: • The range of variation in the task parameters was increased from 20% to 75%; and • All four parameters were varied simultaneously as opposed to one at a time.
The task parameters were varied in a four-dimensional structured grid ranging from each parameter's 25% and 175% of the base value, respectively.Thus, with the updated limits, the effect of Each parameter axis was divided into 10 equal steps and each set of task parameters were sampled three times (with varying pipe distribution) and averaged.Therefore, all three networks were tested for 3 × 10 4 episodes.To compare the generalization performance of the networks pairwise, the difference in the number of successfully passed pipes and the number of collisions are presented in the following histogram plots of Figure 4.The horizontal axis shows the difference in either f 0 or f 1 , whereas the vertical axis shows the frequency of these results.Having a skewed distribution to the right side of the 0-value is better for the left histogram (i.e., score of pipes), whereas the opposite is better for the right histogram (i.e., score of hits) for each network.
The histograms show that CS performs better than both C and S by a large margin (Fig. 4(a) and (b)).Interestingly, C and S have similar generalization even though they have very different architectures (Fig. 4(c)).These results are also evident in the summary boxplot of Fig. 5. Therefore, even though each of C and S do not perform well alone, when combined into CS, they work well together and allow generalization to a wide range of new situations.

BEHAVIOR ANALYSIS
To understand how the CS architecture outperforms its individual components C and S, a set of task parameters [Flap=-7.0,Grav-ity=0.6,Fwd=8.8,Drag=0.6],which was included in the generatization tests in Section 3 was evaluated further.This setting has previously unseen exaggarated effects for flap and forward, and previously unseen diminished effects for gravity and drag.Thus, actions tend to push up and speed up the agents more than expected, and it is difficult for it to slow down and come down.Generalization requires both extrapolation of the task parameter limits as well as understanding previously unseen interaction between them.All three networks were tested in the same environment and their behavior tracked in detail.
The C network was able to pass 15 pipes successfully, and collided with six pipes, whereas S performed slightly better by passing 16 pipes with five collisions.On the other hand, CS remarkably managed to pass all 21 pipes without hitting any of them.Fig. 6 illustrates how different the strategy of CS is from those of C and S. Both C and S use all four actions (flap, forward, simultaneously flap and forward, or do nothing, i.e. glide), but CS never uses flap.That action simply lifts the agent up, which is rarely optimal action in this environment where it takes such a long time to come down.If it is necessary to go up that is because the opening is high, and in that case it is more efficient to move forward as well.
As an illustration, Fig. 7 shows a situation at the 4th and 5th pipe.Both C and S make a similar mistake by flapping up and forward.
They end up too high too fast, do not have enough time to come back down, and crash into the 5th pipe.In contrast, as soon as the 5th pipe becomes visible, CS refrains from both actions while there is enough time for weaker gravity and drag to slow and pull down the agent, and it reaches the opening in the 5th pipe just fine.

DISCUSSION AND FUTURE WORK
The proposed Context+Skill approach adapts to unseen situations by representing context explicitly.Compared to its components, it has a remarkable ability to generalize to unseen situations.In this proof-of-concept study, the architecture of the neural network model has a fixed-topology which constrains the model's functionality.Evolution of the network topology together with its weights [23,24] will be a natural extension to this work.
Besides the architecture, the choice of the tasks used for training plays an important role in the generalization capability of the model.Therefore, one direction for future work is to investigate methodologies that can automatically design a curriculum, i.e., a set of new training tasks and a better order to learn them [13,18,20,22,26].
Another direction for future work is to look into the hidden layer patterns to see if any evidence can be found for the observed generalization capabilities [28] or representational capacity [3].There is plenty of work about learned hierarchical representations in applications such as computer vision [27] and natural language understanding [2], however it is still limited in reinforcement learning tasks.
Lifelong machine learning tries to mimic how humans and animals learn by accumulating the knowledge gained from past experience and using it to incrementally adapt to new situations [19].The generalization ability presented in this work can serve as a foundation for continual learning.It can provide an initial rapid adaptation to new situations upon which further learning can be based.How to convert generalization into a permanent ability in this manner is an interesting direction of future research.

CONCLUSION
Perhaps the main challenge in deploying artificial agents in the real world is that they are brittle-they can only perform well in situations for which they were trained.However, this paper demonstrates an alternative approach based on separating contexts from the actual skills.Context can then be used to modulate the actions in a systematic manner, significantly extending the unseen situations that can be handled.This principle was evaluated in a challenging version of the Flappy Bird game, and shows to perform better than traditional training and general memory-based training.This Context+Skill approach should be useful in many control and decision making tasks in the real world.

Figure 2 :
Figure 2: The architecture of the Context+Skill network and its ablations.(a) The network consists of three components: a Skill module that processes the current situation, a Context module that integrates observations over the entire task, and a Controller that combines the outputs of both modules, thereby using context to modulate actions.This architecture is compared to (b) context-only ablation, and (c) skill only ablation in the experiments.Each component is found to play an important role, allowing the CS network to generalize much better than its ablations.

Figure 3 :
Figure 3: Vanilla LSTM cell[9].The cell acts as a long-term memory store for activation values, which makes it possible to build sequence processing architectures based on it.The internal functionality is based on input, forget, and output gates that need to be learned to achieve the desired behavior.An LSTM cell is used in the Context module to integrate observations over multiple episodes.

Figure 4 :
Figure 4: Generalization differences between Context+Skill network and its ablations.The x-axis shows the differences in generalization performance across the 3 × 10 4 test episodes for the three pairs of architectures.A distribution that is skewed to the right of the 0 line (blue dashed line) is better on the left histogram (showing f 0 , or number of pipes), and one that is skewed to the left is better on the right histogram (showing f 1 , or number of hits).CS generalizes much better than C (a) or S (b), which are about equal (c).

Figure 5 :
Figure 5: Summary of the generalization distributions.The data from Fig. 4 is organized into boxplot so that the distributions for the different architectures can be compared more clearly.CS generalizes better than both C and S, which are rather similar.CS thus combines the abilities of both C and S for superior generalization.

Figure 6 :Figure 7 :
Figure 6: Action selection of the three networks over time.This environment has enhanced effects for flap and forward, and diminished effects for gravity and drag.While (a) S and (c) C use all four possibe actions, (c) CS refrains from using the flap action, which is rarely useful in this environment.