Learning to Control Renewal Processes with Bandit Feedback

Abstract

We consider a bandit problem with K task types from which the controller activates one task at a time. Each task takes a random and possibly heavy-tailed completion time, and a reward is obtained only after the task is completed. The task types are independent from each other, and have distinct and unknown distributions for completion time and reward. For a given time horizon τ, the goal of the controller is to schedule tasks adaptively so as to maximize the reward collected until τ expires. In addition, we allow the controller to interrupt a task and initiate a new one. In addition to the traditional exploration-exploitation dilemma, this interrupt mechanism introduces a new one: should the controller complete the task and get the reward, or interrupt the task for a possibly shorter and more rewarding alternative? We show that for all heavy-tailed and some light-tailed completion time distributions, this interruption mechanism improves the reward linearly over time. Applications of this model include server scheduling, optimal free sampling strategies in advertising and adaptive content selection. From a learning perspective, the interrupt mechanism necessitates learning the whole arm distribution from truncated observations. For this purpose, we propose a robust learning algorithm named UCB-BwI based on median-of-means estimator for possibly heavy-tailed reward and completion time distributions. We show that, in a K-armed bandit setting with an arbitrary set of L possible interrupt times, UCB-BwI achieves O(Kłog(τ)+KL) regret. We also prove that the regret under any admissible policy is Ømega(Kłog(τ)), which implies that UCB-BwI is order optimal.

Document Details

Document Type: Pub Defense Publication
Publication Date: Jun 19, 2019
Source ID: 10.1145/3341617.3326158

Entities

People

Atilla Eryilmaz
R. Srikant
Semih Cayci

Organizations

Defense Threat Reduction Agency
National Science Foundation
Ohio State University
University of Illinois Urbana–Champaign

Learning to Control Renewal Processes with Bandit Feedback

Abstract

Document Details

Entities

People

Organizations

Tags

Fields of Study

Readers