On the Convergence of Stochastic Iterative Dynamic Programming Algorithms
Abstract
Recent developments in the area of reinforcement learning have yielded a number of new algorithms for the prediction and control of Markovian environments. These algorithms, including the TD lambda) algorithm of Sutton (1988) and the Q-learning algorithm of Watkins (1989), can be motivated heuristically as approximations to dynamic programming (DP). In this paper we provide a rigorous proof of convergence of these DP-based learning algorithms by relating them to the powerful techniques of stochastic approximation theory via a new convergence theorem. The theorem establishes a general class of convergent algorithms to which both TD(lambda) and Q-learning belong. reinforcement learning, Stochastic approximation, Convergence, Dynamic programming.
Document Details
- Document Type
- Technical Report
- Publication Date
- Aug 06, 1993
- Accession Number
- ADA276517
Entities
People
- Michael I. Jordan
- Satinder P. Singh
- Tommi S. Jaakkola
Organizations
- Massachusetts Institute of Technology