Off-Policy Evaluation for Grounded Simulation Learning
Abstract
Autonomous agents can be seen as continually executing a policy mapping the current state of the world to an available action (or distribution over actions). The effect of each action is governed by the environment~s transition function,which maps a state and an action to a next state (or distribution over states).If both the policy and the transition function remain constant, an agent can learn about the effects of its policy over time based on direct experience. However, if either (or both) changes, then its past experience may not be directly relevant.The policy or transition function may each change for a variety of reasons. We are concerned with being able to reason about autonomous agents that learn their policies, transfer knowledge across tasks, and/or face environmentalchanges. Our proposed contributions include fundamental mathematical analyses, and are expected to be very broadly applicable. A particularly motivating use case is an agent that learns its policy in an imperfect simulator, with the aim of deploying it safely and effectively in the real world. In such scenarios, it is a critical core capability for robust autonomous agents to be able to predict the effects of a new policy being executed in a known environment or aknown policy being executed in a new environment (or both) before ever having directly relevant experience. With this motivation in mind, we propose to make significant advances in the theoretical and empirical understanding of whathas been come to be known as off-policy evaluation: evaluating the effects of one policy, the evaluation policy, based on data generated by another behavior policy. This scenario is closely related to that of keeping the policy fixed andchanging the environment, for example when learning a policy in simulation and then testing in the real world, in which case we can similarly refer to an evaluation environment and a behavior environment. Our proposed research begins from our own recent advances in both the theory and practice of off-policy evaluation. In brief: 1. We have introduced a new theoretically grounded method for off-policy evaluation that is a form of model-based bootstrapping. The method predicts the effect of running the evaluation policy using multiple different possibleenvironmental models, each of which is induced from a different subset of data generated by the behavior policy. We have shown our methods to generate tighter lower bounds on expected performance of the evaluation policy thanexisting importance-sampling-based methods, thus enabling better expected performance while maintaining safety constraints.2. We have introduced a novel iterative methodology for learning a behavior in simulation for eventual deployment in the real world, called grounded simulation learning (GSL). This method uses machine learning methods to iterativelyimprove a simulator~s transition function (the behavior environment) so that it more closely reflects that of the real world (the evaluation environment). Central to the methodology is a neural-network-based model for predicting the effects ofa policy in the real world based on data from a simulator. In preliminary experiments, we have used GSL to generate the fastest known stable walk on a Softbank Nao bipedal robot. Building on these initial encouraging results, wepropose to:A) Expand and broaden the theoretical results in 1), for example by analyzing what behavior policy is most efficient for generating data to evaluate a given evaluation policy.B) Connect the theoretical results in 1) with the empirical results in 2), for example by constraining the policy learned in simulation to be safe with high confidence in the real world.C) Expand and broaden the empirical results in 2), for example by repeating the methodology on more robots and more tasks, with the aim of making the GSL methodology more general and robust. In doing so, we will both significantly increase the community~s understanding of off-policy evaluation
Document Details
- Document Type
- DoD Grant Award
- Publication Date
- Jul 10, 2018
- Source ID
- N000141812243
Entities
People
- Peter Stone
Organizations
- Office of Naval Research
- United States Navy
- University of Texas at Austin