EXTENSIONS OF THE TWO-ARMED BANDIT AND RELATED PROCESSES WITH ON-LINE EXPERIMENTATION.
Abstract
Sequential decision problems are considered in which immediate payoffs are random with unknown means. Using prior knowledge, each decision is based on the prior expected payoff and the future value of information associated with the observed payoff. A fundamental theory is developed and used to extend the two-armed bandit model to multiple arms and setup costs or bonuses, assuming only two states of nature. Conjugate prior densities lead to a 'stay-on-the-winner' rule for bounded variables. A least-squares policy iteration method is developed for computation. Bounds on the optimal return function are derived for general stochastic dynamic programming problems. (Author)
Document Details
- Document Type
- Technical Report
- Publication Date
- Nov 15, 1965
- Accession Number
- AD0623884
Entities
People
- Kent Quisel
Organizations
- Stanford University