Supervised, Unsupervised and Out-of-Distribution Activity Recognition and Segmentation Using Transformer Networks

Abstract

Approved for Public ReleaseThis proposal is for the development of representations and algorithms for understanding human activities in videos. An activity may be simple, e.g., consisting of a simple movements, or a hierarchy of increasingly complex movements; ranging from localized, atomic, body part movements, to longer-range, semantic movements involving subsets of body parts. Since activities are also often referred to as actions, these two terms are used interchangeably in this proposal. The proposed work is on intersecting parts of three interrelated objectives. (i) Supervised recognition and segmentation of activities. (ii) Unsupervised, deep metric learning. (iii) Recognition and segmentation of Out-of-Distribution (OOD) samples, not included in the In-Distribution (ID) data from training classes. Parts of the proposed research extend or are motivated by PI#s recent work under ONR support. Proposed work in (i), on segmentation, also called generating temporal action proposalsin literature, is aimed at identifying temporal intervals containing human actions in untrimmed videos. For arbitrary actions, this requires learning long-range interactions. Results of human movement studies, in particular Laban Movement Analysis, are used as computational definitions of movements to form hierarchicalrepresentations of actions. Action proposals are learned using transformers with different heads to detect action boundary instantsand action time intervals. The transformers act on the inputs of image features, optical flow fields, 3D pose estimates and the Laban movement based low-to-high-level movement descriptors. The goal is an end-to-end movement to action Transformer network.Proposed objective (ii) extends the work on (i) to learn a semantic space, using deep metric learning (DML). Most current DML approaches relyon optimizing loss functions which tend to push representations of all samples of a class close to each other, preventing them fromlearning fine-grained intra-class features in the process. Learning these fine-grained features is crucial for generalizable DML. The proposed approach uses a new potential field based formulation which is sensitive to variations within classes and helps learn a representation space reflective of these fine-grained variations. It uses attractive and repulsive force components whose magnitudesdecrease with increasing distance. The attractive component helps pull together similar sample representations within a class. The repulsive component pushes away inter-class samples located close to each other. Most force arises due to local similarities and dissimilarities. The proposed potential field model avoids using complicated sample mining strategies, by using class proxies to model underlying class distributions. This may help better optimize inter-sample interactions and enable stable and faster convergence. Proposed objective (iii), OOD segmentation, poses a significant challenge to both supervised and unsupervised cases in objectives (i) and (ii) as it requires learning about classes not seen during training. Existing methods address this problem by contrasting behaviors of class models on samples from ID and OOD at the pixel level. They fail to utilize the spatial structure of images, leading to spatially noisy results. The proposed work is towards class-agnostic structure-constrained learning of segmentation and consists of class-agnostic region proposal generation and structure-constrained rectification. The former is proposed to extract region prototypes to learn visual appearance and structure of training data and to generate probabilistic region proposals for unseen input images.The rectification part is proposed to use structural information to improve per-pixel segmentation predictions of existing algorithms. To evaluate its generality, the developed method is also proposed to be tested on the tasks of domain adaptation and zero-shot semantic segmentation.

Document Details

Document Type
DoD Grant Award
Publication Date
Mar 08, 2024
Source ID
N000142412169

Entities

People

  • Narendra Ahuja

Organizations

  • Office of Naval Research
  • United States Navy
  • University of Illinois Urbana–Champaign

Tags

Fields of Study

  • Computer science

Readers

  • Computer Vision.
  • Neural Network Machine Learning.

Technology Areas

  • AI & ML
  • AI & ML - Neural Networks
  • Space