CHARM: Compositional and Hierarchical Action Reasoning and Modeling

Abstract

Siemens Corporation, Corporate Technology proposes to build a retrieval system that, given a user query in natural language, searches for videos that match the activity specified in the query. Our approaches aim for the AutomatedImage Understanding Thrust for the Computational Methods for Decision Making program of the Office of Naval Research (ONR). Central to our proposal is the design of an intelligent visual dialog agent for interaction between the operator and the activity recognition system. This is motivated by the need for an intuitive, interactive dialogenabled agent for end-users of such a system who are not necessarily experts in Computer Vision or related areas. Besides providing an easy-to-use, natural language interface for the operator, the proposed system facilitates unrestricted definition of target activities, i.e., the user can search for unseen activities of interest via sentential queries. While this unrestricted nature may result in false alarms or ambiguous retrieval results, we address this issue by means of a ???clarification dialog??? between the operator and the algorithm, iteratively refining the retrieval results until the activity of interest is detected. A key motivating factor for our design is the ability for the user, via natural language, tospecify previously unseen activities. The current state-of-the-art activity recognition algorithms are typically restricted to visually clear primitive actions due to challenges arising from the wide variety and diversity in the kinds of human actions seen in videos. Furthermore, these primitive actions either lack interactions among humans and objects or are constrained to controlled scenarios. These restrictive assumptions are clearly a performance bottleneck for surveillance applications where activities of interest are typically atypical! To address this critical challenge, we exploit the compositional and hierarchical nature of the natural language in our algorithm design, which enables us to detect unseen activities of interest, thereby eliminating the need for re-training our models for new activities. We propose a novel activity recognition algorithm, Compositional and Hierarchical Action Reasoning and Modeling (CHARM), to drive the underlying analytics of our retrieval system. CHARM addresses thechallenge of detecting unseen activities of interest by exploiting the compositional and hierarchical nature of the natural language, thereby eliminating the need for re-training our models for new activities. A key novelty of CHARM is the ability to solve the three key tasks involved in activity recognition, viz. multi-view tracking, primitive activity recognition, and object interaction recognition, jointly. This enables CHARM to ask operator of the system meaningful questions as part of the clarification dialog process, enabling the system to focus the search on the most desired video representing the activity of interest.

Document Details

Document Type
DoD Grant Award
Publication Date
Sep 19, 2018
Source ID
N000141812753

Entities

People

  • Jan Ernst

Organizations

  • Office of Naval Research
  • Siemens
  • United States Navy

Tags

Fields of Study

  • Computer science

Readers

  • Agent-Based Social Robotics and Mobile-Assisted Learning in Virtual Environments.
  • Computational Linguistics
  • Neural Network Machine Learning.

Technology Areas

  • AI & ML