Semantic Entity Detection and Localization in Videos using Natural Language Queries

Abstract

Detection and localization of semantic entities, such as objects and actions, in videos is important for many tasks of interest to the U.S. Navy and of fundamental scientific interest. Traditional methods for this purpose have used detectors trained on pre-defined categories; these require large training data and are limited to the defined categories. An enhancement is to “query by example”, where the categories are not pre-defined but indicated by given examples; while this expands the class of entities that can be detected, providing multiple training examples puts a big burden on the user and is not feasible in general. We propose to develop techniques for semantic detection and localization where the entities of interest are defined in natural language. Language expresses semantic entities directly and provides a comfortable interface for the user; it also allows for “query expansion” so that the system is not limited to pre-defined classes but can find related ones as well. The query can be complex to include attributes and spatio-temporal relations in addition to the atomic noun/verb phrases. In recent years, great progress has been achieved in detection and recognition of objects largely powered by the neural-network, deep learning paradigm. A common characteristic of these methods is that they are based on training for pre-defined categories and require large amounts of annotated training data. These methods can be generalized for new categories if sufficient number of examples is given. The methods require ability to quickly learn new classifiers, perhaps using learned features from existing categories. However, it is not practical for a user to collect such examples when looking for objects and actions not in the original dictionary. We propose to use natural language for specifying activities of interest. Representation of semantics in two modalities is inherently different and the both also have critical ambiguities, making their synthesis highly challenging. The approaches that have been most successful embed both language and visual entities in feature spaces and learn a sub-space that relates the two. While there has been some work in detecting and localizing objects using language, there has been relatively little work on detection and localization of activities using natural language and will be a key focus of our proposed work. We propose to exploit not only the atomic phrases that apply to a single entity but also include their attributes and complex relations; the queries can thus be complex phrases or sentences and not a single noun phrase or verb phrase. The need for finely annotated training data is a major hurdle in use of deep network paradigms. We propose to overcome this difficulty by use of a layered approach where the lower layers can be trained from datasets that are not domain specific. Intermediate layers will require data where images and videos are paired but we expect to be able to work with weak supervision. At the highest levels, as stated above, we intend to use more transparent networks which have few parameters and can be learned with small amounts of dataantic

Document Details

Document Type
DoD Grant Award
Publication Date
Dec 20, 2017
Source ID
N000141812050

Entities

People

  • Ramakant Nevatia

Organizations

  • Office of Naval Research
  • United States Navy
  • University of Southern California

Tags

Fields of Study

  • Computer science

Readers

  • Computational Linguistics
  • Neural Network Machine Learning.
  • Systems Analysis and Design

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval
  • AI & ML - Neural Networks
  • Space