A Response to Bertsekas' A Counterexample to Temporal-Differences Learning.

Abstract

For an absorbing Markov chain with a reinforcement on each transition, Bertsekas (1995a) gives a simple example where the function learned by TD(lambda) depends on lambda. Bertsekas showed that for lambda=1 the approximation is optimal with respect to a least-squares error of the value function, and that for lambda=0 the approximation obtained by the TD method is poor with respect to the same metric. With respect to the error in the values, TD(1) approximates the function better than TD(0). However; with respect to the error in the differences in the values, TD(0) approximates the function better than TD(1). TD(1) is only better than TD(0) with respect to the former metric rather than the latter. In addition, direct TD(lambda) weights the errors unequally, while residual gradient methods (Baird, 1995, Harmon, Baird, & Klopf, 1995) weight the errors equally. For the case of control, a simple Markov decision process is presented for which direct TD(0) and residual gradient TD(0) both learn the optimal policy, while TD(1) learns a suboptimal policy. These results suggest that, for this example, the differences in state values are more significant than the state values themselves, so TD(0) is preferable to TD(1).

Open PDF

Document Details

Document Type: Technical Report
Publication Date: Nov 22, 1996
Accession Number: ADA321555

Entities

People

Leemon C. Baird
Mance E. Harmon

Organizations

Wright Laboratory

A Response to Bertsekas' A Counterexample to Temporal-Differences Learning.

Abstract

Document Details

Entities

People

Organizations

Tags

Communities of Interest

DTIC Thesaurus Topics

Readers