EXTENSIONS OF THE TWO-ARMED BANDIT AND RELATED PROCESSES WITH ON-LINE EXPERIMENTATION.

Abstract

Sequential decision problems are considered in which immediate payoffs are random with unknown means. Using prior knowledge, each decision is based on the prior expected payoff and the future value of information associated with the observed payoff. A fundamental theory is developed and used to extend the two-armed bandit model to multiple arms and setup costs or bonuses, assuming only two states of nature. Conjugate prior densities lead to a 'stay-on-the-winner' rule for bounded variables. A least-squares policy iteration method is developed for computation. Bounds on the optimal return function are derived for general stochastic dynamic programming problems. (Author)

Document Details

Document Type
Technical Report
Publication Date
Nov 15, 1965
Accession Number
AD0623884

Entities

People

  • Kent Quisel

Organizations

  • Stanford University

Tags

Communities of Interest

  • Materials and Manufacturing Processes

DTIC Thesaurus Topics

  • Applied Mathematics
  • Computations
  • Computer Programming
  • Computing-Related Activities
  • Dynamic Programming
  • Interdisciplinary Science
  • Iterations
  • Mathematical Analysis
  • Mathematical Programming
  • Mathematics
  • Numerical Analysis

Readers

  • Adaptive Control and Estimation with Uncertainty in Dynamic Systems.