Rapid Adaptation and Improvisation with RL and Robotic Foundation Models
Abstract
Language models such as ChatGPT and their multi-modal variants, vision-language models, represent a compelling class of models thatcan not only answer questions or recognize objects in images, but can engage in complex problem solving and natural language interaction. However, endowing embodied robotic systems with such capabilities in a way that allows for flexible improvisation of novel solutions to new problems requires situating such methods in the context of decision making and control. What would it take to bring such capabilities to robotic systems? While there are many ways to apply generic foundation models (e.g., vision-language models) to robotic perception and natural language interpretation challenges, true robotic foundation models must be adapted to the capabilities and context of the robotic system. This requires large amounts of data, a suitable choice of supervision signal, and a suitable model architecture. In this project, we will study this question through the lens of lifelong learning, where the robot starts with a standard (non-robot-adapted) foundation model and, over the course of a continuous self-supervised deployment, adapts this model to the unique context and capabilities of the robotic system. In this framework, the generalization abilities of the foundation model trained on Internet-scale data serve as a kind of #common sense prior# that enables the robot to behave reasonably (but suboptimally)in novel situations, interpret feedback and commands from humans, and extract supervision signal from the environment (e.g., by classifying and interpreting outcomes of its actions), while the continual self-supervised adaptation process gradually finetunes the model to control the robot more optimally, learning on the job and becoming better the more it is used. The goal is to enable robots to rapidly improvise solutions to new problems and learn from these improvised solutions to continually improve during real-world deployment. The role of foundation models in this recipe is to provide rich prior semantic knowledge, which can shortcut the otherwiseexceedingly time-consuming training times required by conventional reinforcement learning methods, enable robots to interpret natural language feedback from humans, and provide a basis of semantic #common sense# that provides for reasonable starting points to solve novel physical challenges.For public release
Document Details
- Document Type
- DoD Grant Award
- Publication Date
- Dec 14, 2024
- Source ID
- N000142512060
Entities
People
- Sergey Levine
Organizations
- Office of Naval Research
- United States Navy
- University of California Regents