Procedural Task Learning from Goal-Oriented Activity Data

Abstract

Much of human everyday tasks are procedural, i.e., consist of a sequence of subtasks that must be followed to achieve a certain goal. Automatic learning of such procedural tasks from visual data has fundamental impact on the advancement of artificial intelligence with important applications, such as teaching intelligent agents to perform complex tasks and evaluating their performance, constructing large instructional knowledge bases for education of the workforce, and early detection of malicious activities. Despite recent advances, the emerging problem of automatic procedure learning remains largely unsolved due to major challenges to convert raw uncurated videos into fine-grained instructions. These include large amounts of background irrelevant activities in videos, large variations of performing a task, high cost and complexity of temporally annotating videos from many tasks and difficulty of generalizing to unseen (sub)tasks. In this project, we develop a comprehensive mathematical framework for learning fine-grained instructions from uncurated long procedural videos using no or minimal supervision, overcoming these challenges. In Aim 1, we develop unsupervised methods that discover subtasks in videos while handling large background actions and appearance variations of subtasks. To do so, we model subtasks by low-dimensional manifolds and propose a new class of weak sequence alignment methods that simultaneously learns manifolds and finds associations between manifold sequences of uncurated videos. We generalize our framework to align noisy narrations with videos for more effective discovery of subtasks. We also develop clustering methods that separate videos into tasks and variations of a task, hence, provide inputs to our alignment technique. In Aim 2, we address grounding of videos using noisy but extremely cheap weak supervision (e.g., titles, tags and transcripts). We propose a probabilistic method that leverages all forms of weak supervision in a unified framework to recognize tasks, predict sequences of subtasks and localize them in videos. We propose efficient methods to integrate our unsupervised methods in Aim 1 with the weakly-supervised method for more effective learning. We also extend our method to anticipate and generate future subtasks given observations from current steps of a task. In Aim 3, we address learning of rare and previously unseen (sub)tasks. We propose a compositional learning framework, where we model each subtask as interaction between elementary components (e.g., actions and objects) and propose to combine the learned models of interaction components to transfer knowledge from seen to unseen interactions. To enable recognition and generation of fine-grained instructions, we extend our compositional method to handle higher-order complex interactions, with more components and simultaneously occurring interactions. The outcomes of the project enable to design intelligent agents that perform or help humans in various complex tasks with minimal intervention. The results of the project will also enable building assistive robots and technologies that would help soldiers, veterans, elderly and patients in doing mission-related and daily activities. The project will benefit context-driven decision making and proactive decision support systems, which are fundamental to the missions of the army. All codes and data will be released to the public and findings will be disseminated via tutorials, workshops and surveys as well as publications in top AI venues.

Document Details

Document Type: DoD Grant Award
Publication Date: Jun 25, 2021
Source ID: W911NF2110276

Entities

People

Ehsan Elhamifar

Organizations

Army Contracting Command
Northeastern University
United States Army

Procedural Task Learning from Goal-Oriented Activity Data

Abstract

Document Details

Entities

People

Organizations

Tags

Fields of Study

Readers

Technology Areas