Policy Optimization for Reinforcement Learning Beyond Cumulative Rewards

Abstract

The goal of this project is to establish the optimization theory and develop sample-efficient algorithms for reinforcement learning (RL) with general objectives that are beyond a cumulative sum of rewards. Examples of such problems arise from control and sequential decision-making applications that involve: Dynamic resource allocation; Exploration in unknown environments;Safety constraints and imitation learning. We aim to address the challenge that many of them cannot be solved using known RL methods. We consider policy optimization in Markov Decision Processes, where the objective is a general concave utility function of the long-term state-action occupancy measure, which subsumes several of the aforementioned examples as special cases. Such generality invalidates the Bellman equation. As this means that dynamicprogramming no longer works, we focus on direct policy optimization. We have three specific aims: Firstly, we will investigate the computation/estimation of the policy gradient for general utilities and general policy parametrizations. Note that the standard Policy Gradient Theorem no longer holds. We will establish that the parametrized policy gradient may be obtained as the solution of a stochastic saddle point problem, and we will develop a set of variational Monte Carlo gradient estimation algorithms to compute the policy gradient based onsample paths. Secondly, we will investigate the global convergence of policy gradient-based algorithms for RL with general utilities, and in particular we will establish sample complexity bounds for using these methods for learning the optimal policy. Further, we will exploit the problems hidden convex nature for developing accelerated algorithms that are generalizable. Thirdly, our last aim is to apply and empirically validate our methods in specific use cases, where one example is the optimal network intervention problem for mitigating potential pandemics in a community. This project will generate a set of deployable methods and theoretical results, which will enable policy optimization in practical RL systems for a broader class of complex tasks.

Document Details

Document Type
DoD Grant Award
Publication Date
Apr 06, 2021
Source ID
N000142112288

Entities

People

  • Mengdi Wang

Organizations

  • Office of Naval Research
  • Trustees of Princeton University
  • United States Navy

Tags

Fields of Study

  • Computer science

Readers

  • Adaptive Control and Estimation with Uncertainty in Dynamic Systems.
  • Neural Network Machine Learning.
  • Operations Research

Technology Areas

  • AI & ML
  • AI & ML - Bayesian Inference
  • AI & ML - Machine Learning Algorithms