Integrating AOG and DNN for Human Activity by Learning from Small Examples

Abstract

Objective. We propose to develop a unified framework for integrating hierarchical compositional And-Or Graph (AOG) models and Deep Neural Networks (DNN) for human activities understanding in videos. The new representation will enable learning complex knowledge about human activities (3D poses, attributes, actions. interactions with contextual objects, and causal effects of actions etc.) from small examples through interactive Query-Answer (QA). The unified framework will achieve the following three properties: (i) Combining the strengths and benefits of DNN and AOG. AOG is a hierarchical and compositional graph with interpretable nodes representing explicitly the spatial decomposition of human body in parts and their kinematic dependencies, the temporal decomposition of activity in actions and interactions with objects, and the causal relations between actions and object status (i.e., fluents) changes. We further augment the AOG by associating attributes with nodes at all levels of the hierarchy: appearance attributes for dressing styles and geometric attributes for concurrent actions. The DNNs are known to have rich features, though mostly implicit and not directly interpretable, which leads to improved performance on complex data through end-to-end training. (ii) Learning from small examples through weakly supervised learning and QA. We will disentangle the nodes in DNN by imposing some regularization terms to make the nodes more interpretable, and thus make tight and sparse connections between the DNN nodes and AOG nodes. The improved interpretability will enable semantically meaningful communications between the nodes in the representation and human users, and thus we can grow the AOG to represent complex knowledge. (iii) Effective inference and information fusion. Technical Approach. The proposed work can be divided in three tasks: (i) Task 1: Integrating AOG and DNN with bottom-up and top-down inference. Given an image or short video, the output for activity understanding is a parse graph whose nodes are derived from the AOG and grounded on DNN features. Each node in the parse graph (its score or log-probability) is contributed by a, ยง, y-processes, and the weights of these processes vary due to occlusion, small resolution and hidden fluents. It is crucial to have the right credits/penalty assignment during the training stage for the success and failure among these pathways. (ii) Task 2: Learning human object interaction and object fluents change. In most daily activities excluding some sports and dance, human actions are goal-guided to changing certain fluent of some objects in the scene. Therefore we will model human object interactions, track the objects under manipulation, and detect the changing fluent. More specifically we will train generative AOG + DNN model to represent the object change of appearance (e.g. pushing a button to turn it on), geometry (i.e. blowing a balloon), and topology (i.e. cutting an apple). (iii) Task 3: Tightly coupling AOG and DNN and learning through weakly supervised QA. In contrast to simply channeling DNN feature to AOG nodes, we impose new loss functions to disentangle DNN nodes and to make them more interpretable and tightly coupled with the nodes at multiple levels (granularity). Then we will grow the AOG to add new concepts encountered in video by QA and weakly supervised learning. The QA will utilize contextual information, such as "what is the left hand is touching". Outcome and Impact. The proposed work has immediate impacts on a range of DoD missions, including persistent ISR in video surveillance and information gathering, fusion and retrieval. To support these applications, we target a range of domain tasks (i) Joint 2D/3D human pose, parts and attribute recognition on large-scale datasets; (ii) Joint inference of actions and object fluent changes; and (iii) Inferring the underlying tasks and goals of the humans in video and predicting next actions.

Document Details

Document Type
DoD Grant Award
Publication Date
Feb 14, 2019
Source ID
W911NF1810296

Entities

People

  • Song-Chun Zhu

Organizations

  • Army Contracting Command
  • United States Army
  • University of California, Los Angeles

Tags

Fields of Study

  • Computer science

Readers

  • Neural Network Machine Learning.

Technology Areas

  • AI & ML
  • AI & ML - Neural Networks