Watch and Practice to Improve- Learning to Act by Watching Human Videos

Abstract

Building general-purpose robotic agents for complex visual environments remains a challenge. Current approaches often rely on engineered setups or high supervision, which are not scalable. Deploying inexperienced robots in the world poses a chicken-and-egg problem for robot learning because to collect experience safely, the robot needs to be deployed but to be deployed they need data first to train. We propose to get around this issue by using diverse amount of human video data available on the internet to bootstrap robot learning. However, using human data presents challenges, such as understanding agency, inferring physical forces, and addressing the human-robot embodiment gap. To address these problems, we propose a general-purpose framework with the key observation that in order to build general purpose agents, the focus has to be on both visual and motor aspects of human videos. This decoupling of agent from environment allows us to build a roadmap for generalizable robots. We aim to learn visual or environment-level generalization by building object, affordances, and 3D-aware dynamics models that efficiently predict an agent s action consequences. We also propose explicit and implicit ways to build joint human-robot action representations, allowing for efficient bootstrapping of action and control policies. While action and environment priors offer good initializations, mastering tasks requires also requires practicing in the real world and just watching is not enough to learn grounded skills. We outline a framework addressing errors in the previously learned models, occlusions, and lack of force sensing, and propose approaches to help robots learn new skills and improve beyond the priors learned from humans. We will evaluate our proposed work in home scenarios, where the robot will have to perform complex, long-horizon manipulation and navigation tasks.

Document Details

Document Type: DoD Grant Award
Publication Date: Mar 14, 2024
Source ID: FA95502310747

Entities

People

Deepak Pathak

Organizations

Air Force Office of Scientific Research
Carnegie Mellon University
United States Air Force

Watch and Practice to Improve- Learning to Act by Watching Human Videos

Abstract

Document Details

Entities

People

Organizations

Tags

Fields of Study

Readers

Technology Areas