Approximating Action-Value Functions: Addressing Issues of Dynamic Range

Abstract

Function approximation is necessary when applying RL to either Markov decision processes (MDPs) or semi-Markov decision processes (SMDPs) with very large state spaces. An often overlooked issue in approximating Q-functions in either framework arises when an action value update in a given state causes a large policy change in other states. Another way of stating this is to say that a small change in the Q-function results in a large change in the implied greedy policy. We call this sensitivity to changes in the Q-function the dynamic range problem and suggest that it may result in greatly increasing the number of training updates required to accurately approximate the optimal policy. We demonstrate that Advantage Learning solves the dynamic range problem in both frameworks and is more robust than some other RL algorithms on these problems. For an MDP, the Advantage Learning algorithm addresses this issue by re-scaling the dynamic range of action values within each state by a constant. For SMDPs the scaling constant can vary for each action.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Dec 17, 1998
Accession Number
ADA374186

Entities

People

  • Mance E. Harmon

Organizations

  • University of Massachusetts Amherst

Tags

Communities of Interest

  • Human Systems
  • Materials and Manufacturing Processes

DTIC Thesaurus Topics

  • Abstracts
  • Air Force
  • Algorithms
  • Artificial Intelligence
  • Computer Science
  • Dynamic Range
  • Information Processing
  • Information Systems
  • Learning
  • Machine Learning
  • Neural Networks
  • Reinforcement Learning
  • Sensitivity
  • Supervised Machine Learning
  • Technical Information Centers
  • Time Intervals
  • Training

Fields of Study

  • Computer science

Readers

  • Adaptive Control and Estimation with Uncertainty in Dynamic Systems.
  • Calculus or Mathematical Analysis
  • Statistical inference.

Technology Areas

  • Space