Importance sampling in reinforcement learning with an estimated behavior policy

Abstract

In reinforcement learning, importance sampling is a widely used method for evaluating an expectation under the distribution of data of one policy when the data has in fact been generated by a different policy. Importance sampling requires computing the likelihood ratio between the action probabilities of a target policy and those of the data-producing behavior policy. In this article, we study importance sampling where the behavior policy action probabilities are replaced by their maximum likelihood estimate of these probabilities under the observed data. We show this general technique reduces variance due to sampling error in Monte Carlo style estimators. We introduce two novel estimators that use this technique to estimate expected values that arise in the RL literature. We find that these general estimators reduce the variance of Monte Carlo sampling methods, leading to faster learning for policy gradient algorithms and more accurate off-policy policy evaluation. We also provide theoretical analysis showing that our new estimators are consistent and have asymptotically lower variance than Monte Carlo estimators.

Document Details

Document Type
Pub Defense Publication
Publication Date
May 07, 2021
Source ID
10.1007/s10994-020-05938-9

Entities

People

  • Josiah P. Hanna
  • Peter Stone
  • Scott Niekum

Organizations

  • Army Research Office
  • Defense Advanced Research Projects Agency
  • General Motors
  • Lockheed Martin
  • National Science Foundation Directorate of Computer and Information Science and Engineering
  • Office of Naval Research
  • Robert Bosch LLC

Tags

Fields of Study

  • Mathematics

Readers

  • Neural Network Machine Learning.
  • Statistical inference.

Technology Areas

  • AI & ML
  • AI & ML - Bayesian Inference
  • AI & ML - Machine Learning Algorithms