Approximating Action-Value Functions: Addressing Issues of Dynamic Range

Abstract

Function approximation is necessary when applying RL to either Markov decision processes (MDPs) or semi-Markov decision processes (SMDPs) with very large state spaces. An often overlooked issue in approximating Q-functions in either framework arises when an action value update in a given state causes a large policy change in other states. Another way of stating this is to say that a small change in the Q-function results in a large change in the implied greedy policy. We call this sensitivity to changes in the Q-function the dynamic range problem and suggest that it may result in greatly increasing the number of training updates required to accurately approximate the optimal policy. We demonstrate that Advantage Learning solves the dynamic range problem in both frameworks and is more robust than some other RL algorithms on these problems. For an MDP, the Advantage Learning algorithm addresses this issue by re-scaling the dynamic range of action values within each state by a constant. For SMDPs the scaling constant can vary for each action.

Open PDF

Document Details

Document Type: Technical Report
Publication Date: Dec 17, 1998
Accession Number: ADA374186

Entities

People

Mance E. Harmon

Organizations

University of Massachusetts Amherst

Approximating Action-Value Functions: Addressing Issues of Dynamic Range

Abstract

Document Details

Entities

People

Organizations

Tags

Communities of Interest

DTIC Thesaurus Topics

Fields of Study

Readers

Technology Areas