Policy Optimization for Reinforcement Learning Beyond Cumulative Rewards
Abstract
The goal of this project is to establish the optimization theory and develop sample-efficient algorithms for reinforcement learning (RL) with general objectives that are beyond a cumulative sum of rewards. Examples of such problems arise from control and sequential decision-making applications that involve: Dynamic resource allocation; Exploration in unknown environments;Safety constraints and imitation learning. We aim to address the challenge that many of them cannot be solved using known RL methods. We consider policy optimization in Markov Decision Processes, where the objective is a general concave utility function of the long-term state-action occupancy measure, which subsumes several of the aforementioned examples as special cases. Such generality invalidates the Bellman equation. As this means that dynamicprogramming no longer works, we focus on direct policy optimization. We have three specific aims: Firstly, we will investigate the computation/estimation of the policy gradient for general utilities and general policy parametrizations. Note that the standard Policy Gradient Theorem no longer holds. We will establish that the parametrized policy gradient may be obtained as the solution of a stochastic saddle point problem, and we will develop a set of variational Monte Carlo gradient estimation algorithms to compute the policy gradient based onsample paths. Secondly, we will investigate the global convergence of policy gradient-based algorithms for RL with general utilities, and in particular we will establish sample complexity bounds for using these methods for learning the optimal policy. Further, we will exploit the problems hidden convex nature for developing accelerated algorithms that are generalizable. Thirdly, our last aim is to apply and empirically validate our methods in specific use cases, where one example is the optimal network intervention problem for mitigating potential pandemics in a community. This project will generate a set of deployable methods and theoretical results, which will enable policy optimization in practical RL systems for a broader class of complex tasks.
Document Details
- Document Type
- DoD Grant Award
- Publication Date
- Apr 06, 2021
- Source ID
- N000142112288
Entities
People
- Mengdi Wang
Organizations
- Office of Naval Research
- Trustees of Princeton University
- United States Navy