A Response to Bertsekas' A Counterexample to Temporal-Differences Learning.

Abstract

For an absorbing Markov chain with a reinforcement on each transition, Bertsekas (1995a) gives a simple example where the function learned by TD(lambda) depends on lambda. Bertsekas showed that for lambda=1 the approximation is optimal with respect to a least-squares error of the value function, and that for lambda=0 the approximation obtained by the TD method is poor with respect to the same metric. With respect to the error in the values, TD(1) approximates the function better than TD(0). However; with respect to the error in the differences in the values, TD(0) approximates the function better than TD(1). TD(1) is only better than TD(0) with respect to the former metric rather than the latter. In addition, direct TD(lambda) weights the errors unequally, while residual gradient methods (Baird, 1995, Harmon, Baird, & Klopf, 1995) weight the errors equally. For the case of control, a simple Markov decision process is presented for which direct TD(0) and residual gradient TD(0) both learn the optimal policy, while TD(1) learns a suboptimal policy. These results suggest that, for this example, the differences in state values are more significant than the state values themselves, so TD(0) is preferable to TD(1).

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Nov 22, 1996
Accession Number
ADA321555

Entities

People

  • Leemon C. Baird
  • Mance E. Harmon

Organizations

  • Wright Laboratory

Tags

Communities of Interest

  • Autonomy
  • C4I

DTIC Thesaurus Topics

  • Air Force
  • Air Force Facilities
  • Algorithms
  • Computer Science
  • Dynamic Programming
  • Equations
  • Errors
  • Frequency
  • Governments
  • Learning
  • Machine Learning
  • Markov Chains
  • Reinforcement Learning
  • Residuals
  • Transitions
  • United States
  • United States Government

Readers

  • Approximation Theory.
  • Astronomy and Astrophysics.
  • Operations Research