Approximate dynamic programming and reinforcement learning: Non-parametric methods and instance dependence

Abstract

There are various settings in which it is desirable to control the behavior of a stochastic system that evolves in a dynamic fashion over time, with the goal of maximizing the expected reward received. Such stochastic control problems have a wide range of applicat ions, among them robotics, competitive game-playing systems, circuit design and layout problems, resource allocation problems, indus trial process control, and control of chemical reactions. Such problems are tackled with a combination of techniques from control th eory, statistics, machine learning, and numerical methods. Recent years have witnessed dramatic empirical advances in approximate dy namic programming (ADP) and reinforcement learning (RL), but there are various theoretical questions that remain answered. This rese arch proposal tackles fundamental questions concerning the use of nonparametric methods in reinforcement learning, as well as the de velopment of instance-dependent bounds and algorithms that achieve them. Here we describe these research thrusts briefly, with more detail given in the body of the proposal. For the subclass of ADP/RL problems in which the state and action spaces are discrete, tak ing on finitely many values, our theoretical understanding of algorithms and fundamental limits is fairly complete. However, many RL problems involve state and/or action spaces that are continuous, and apart from certain special cases (e.g., linear quadratic contr ol), there are fewer theoretical guarantees in such settings. Such problems require the flexibility and richness provided by non-par ametric function classes, including kernel-based methods, regression trees, and neural networks, among others. One thrust of the pro posed research is to undertake a systematic development of fundamental theory for non-parametric methods in application to RL proble ms, including the problems of fitted value iteration, fitted policy optimization and Q-learning, as well as off-policy versions of t hese same problems.This research will exploit cutting edge techniques from empirical process theory, concentration of measure, and h igh dimensional statistics in order to obtain sharp upper and lower bounds. The second research thrust is motivated by the empirical finding that many ADP/RL algorithms exhibit a wide variety of behavior across domains and problem instances. Existing theoretical b ounds, which are generally based on worst-case assumptions, fail to capture this variety. In fact, many RL algorithms perform remark ably well in settings in which a worst-case analysis would predict catastrophic failure. A remedy to this issue is to develop a fram ework for instance-specific analysis capable of revealing what aspects of a given problem make it easy or hard. Such theory all ows distinctions to be drawn between ostensibly similar algorithms in terms of their performance profiles. Instance-dependent bounds can also lead to more efficient algorithms, since they allow intermediate computations (such as policy evaluation within a policy o ptimization algorithm) to be terminated early. This is a fundamental research project that is not expected to produce any developmen tal items. Should any developmental items result from this work they will have both civilian and military applications. The intended research is theoretical and will not result in any environmental impacts.

Document Details

Document Type: DoD Grant Award
Publication Date: Aug 20, 2021
Source ID: N000142112842

Entities

People

Martin J. Wainwright

Organizations

Office of Naval Research
United States Navy
University of California Regents

Approximate dynamic programming and reinforcement learning: Non-parametric methods and instance dependence

Abstract

Document Details

Entities

People

Organizations

Tags

Fields of Study

Readers

Technology Areas