Reinforcement Learning Through Gradient Descent

Abstract

Reinforcement learning is often done using parameterized function approximators to store value functions. Algorithms are typically developed for lookup tables, and then applied to function approximators by using backpropagation. This can lead to algorithms diverging on very small, simple MDPs and Markov chains, even with linear function approximators and epoch wise training. These algorithms are also very difficult to analyze, and difficult to combine with other algorithms. A series of new families of algorithms are derived based on stochastic gradient descent. Since they are derived from first principles with function approximators in mind, they have guaranteed convergence to local minima, even on general nonlinear function approximators. For both residual algorithms and VAPS algorithms, it is possible to take any of the standard algorithms in the field, such as Q learning or SARSA or value iteration, and rederive a new form of it with provable convergence. In addition to better convergence properties, it is shown how gradient descent allows an inelegant, inconvenient algorithm like Advantage updating to be converted into a much simpler and more easily analyzed algorithm like Advantage learning. hi this case that is very useful, since Advantages can be learned thousands of times faster than Q values for continuous time problems. In this case, there are significant practical benefits of using gradient descent based techniques. In addition to improving both the theory and practice of existing types of algorithms, the gradient descent approach makes it possible to create entirely new classes of reinforcement learning algorithms. VAPS algorithms can be derived that ignore values altogether, and simply learn good policies directly. One hallmark of gradient descent is the ease with which different algorithms can be combined, and this is a prime example.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
May 14, 1999
Accession Number
ADA366228

Entities

People

  • Leemon C. Baird Iii

Organizations

  • Carnegie Mellon University

Tags

Communities of Interest

  • Air Platforms
  • Autonomy
  • Energy and Power Technologies
  • Weapons Technologies

DTIC Thesaurus Topics

  • Air Force
  • Algorithms
  • Artificial Intelligence
  • Automata Theory
  • Computational Science
  • Computer Science
  • Computers
  • Information Processing
  • Information Science
  • Information Systems
  • Machine Learning
  • Mathematical Models
  • Neural Networks
  • Probability Distributions
  • Random Variables
  • Reinforcement Learning
  • Signal Processing

Fields of Study

  • Computer science

Readers

  • Calculus or Mathematical Analysis
  • Neural Network Machine Learning.
  • Operations Research

Technology Areas

  • AI & ML
  • AI & ML - Machine Learning Algorithms
  • AI & ML - Neural Networks