A Response to Bertsekas' A Counterexample to Temporal-Differences Learning.
Abstract
For an absorbing Markov chain with a reinforcement on each transition, Bertsekas (1995a) gives a simple example where the function learned by TD(lambda) depends on lambda. Bertsekas showed that for lambda=1 the approximation is optimal with respect to a least-squares error of the value function, and that for lambda=0 the approximation obtained by the TD method is poor with respect to the same metric. With respect to the error in the values, TD(1) approximates the function better than TD(0). However; with respect to the error in the differences in the values, TD(0) approximates the function better than TD(1). TD(1) is only better than TD(0) with respect to the former metric rather than the latter. In addition, direct TD(lambda) weights the errors unequally, while residual gradient methods (Baird, 1995, Harmon, Baird, & Klopf, 1995) weight the errors equally. For the case of control, a simple Markov decision process is presented for which direct TD(0) and residual gradient TD(0) both learn the optimal policy, while TD(1) learns a suboptimal policy. These results suggest that, for this example, the differences in state values are more significant than the state values themselves, so TD(0) is preferable to TD(1).
Document Details
- Document Type
- Technical Report
- Publication Date
- Nov 22, 1996
- Accession Number
- ADA321555
Entities
People
- Leemon C. Baird
- Mance E. Harmon
Organizations
- Wright Laboratory