Topic 5.2.1: Action Co-Discovery as a Cross-Reconstruction Problem

Abstract

This proposal describes a principled formulation for automatically discovering actions that are common to a given set of videos. These common actions, which we call coactions, are represented either as a set of frames or in more detail as a set of space-time segmentations. The proposed formulation is based on the idea that each videoƕs role in a certain coaction is measured by how well that video can be used to reconstruct the other videos also participating in the coaction. We explicitly do not incorporate features from the given video into the basis that is representing it to avoid the basis being overwhelmed by the background rather than the action itself. Hence, we call it a cross-reconstruction problem. Neither does this novel formulation require a common or joint representation over all videos to represent the coaction, which is the de facto approach for co-detection and co-segmentation methods; in particular, in video, it is not clear if the necessary underlying action invariants exist in sufficient descriptiveness to actually specify such a joint model. Our formulation does not require any such assumption. This is the first work we are aware of to propose joint spatiotemporal discovery of common actions across multiple videos with no supervision. Furthermore, it is the first such work in vision to relax the assumption that only one action is occurring at any given time in a video. The proposed work builds on our earlier work in temporal discovery of common actions, which, at the time, was the first work on the problem (2012). The proposed work will also leverage our abundant experience in video understanding in a broad sense, and in a more specific sense, it will use our widely adopted video segmentation work. We will also leverage a multi-actor-action dataset as the basic experimental platform that we recently constructed. The main scientific objective is to formulate, solve and study the new cross-reconstruction problem. We plan to do so in the context of visual actions in ground-level video sequences, but the basic formulation is applicable to various space-time data. Our primary proposed methodological direction is joint sparse coding across multiple videos; the approach will jointly learn the bases for cross-reconstruction while segmenting which elements (frames, trajectories or segments) of each video are actually part of the common action. We will apply common action co-discovery both on synthetic data and real data, emphasizing plausible abstractions of Army-relevant scenarios. We will further generalize the method from co-discovery of single coaction to multiple coactions in a given video, and we will scale the method to hundreds of videos being concurrently processed. The proposed formulation has significant implications of Army relevance in future automatic learning of visual phenomena with little or no specific human supervision and in situations where large labeled sets of samples are simply not plausible. We give an example FOB-protection CONOP in the full proposal, but the method has potential to generalize to many scenarios and significantly reduce the amount of human-effort in low-level video data processing. Furthermore, the general framework of cross-reconstruction is applicable beyond videos and actions to other Army-relevant applications like robot path planning. Our ongoing collaboration with members of ARL CISD in Adelphi MD, which has already resulted in numerous co-authored publications and successful field tests, will ensure our research directions remain applicable to the Army mission.

Document Details

Document Type
DoD Grant Award
Publication Date
Oct 17, 2018
Source ID
W911NF1510354

Entities

People

  • Jason J. Corso

Organizations

  • Army Contracting Command
  • United States Army
  • University of Michigan

Tags

Fields of Study

  • Computer science

Readers

  • Artificial Intelligence
  • Computer Vision.

Technology Areas

  • AI & ML
  • Autonomy
  • Space