Foundations of Image and Multimodal Data Analysis (Young Investigator Program): Spatial Memory Networks: A Machine Perception Framework for Detecting Familiar and Unfamiliar Objects in Video

Abstract

Visual object detection (automatically classifying and localizing an object in visual data) is a fundamental problem in computer vision, and has broad applicability in numerous fields including AI, defense, medicine, and agriculture. While there has been a long history of research in detecting objects in static images, there has been relatively little research in detecting objects in videos. (Most works focus instead on tracking, in which the object is manually-annotated in an initial frame.) However, cameras on robots, unmanned vehicles, wearable devices, etc., receive videos and not static images. Thus, for these systems to recognize the key objects and their interactions, it is critical that they be equipped with accurate video object detectors. Existing approaches that detect objects in video simply run a static image-based detector independently on each frame. However, due to the unique challenges of video (e.g., motion blur, low-resolution, compression artifacts), a static image detector cannot generalize well to video. Furthermore, videos provide rich temporal and motion information that should be utilized by the detector during both training and testing for improved performance. In this project, we propose a novel machine perception framework for detecting objects in video. The key contribution is a deep recurrent spatial memory network that models the long-term temporal dependencies of an object s appearance and motion at the pixel-level. By modeling at the pixel-level, our network will provide an interpretable and expressive model of video content. To accomplish these goals, our research objectives are: - Thrust I: Design the spatial memory network to model an object s appearance and motion at the pixel-level over time, and train it to detect objects in videos. Study the limitations of running a static-image detector on video, and explore how modeling long-term temporal dependencies beyond simple short-term motion cues can help detection. - Thrust II: Investigate how to expand our network architecture to detect unfamiliar objects, for which we have no training data. Study the algorithmic changes required, and how to bias the model to detect moving objects rather than static ones. - Thrust III: Explore the various ways in which our models trained for detecting objects in video can serve as a pre-processing step for downstream video applications such as video summarization, search, and annotation propagation. We expect to deliver basic research advancements through scientific publications, mathematical and learning models, and software prototypes. The proposed project entails novel technical contributions in computer vision and machine learning, and will form an integral step towards the overarching goal of object detection in video. Whereas mainstream object detection research focuses on static images, this proposal calls for a distinct emphasis on videos. Our network will be designed to model long-term temporal dependencies and motion cues, to enable the detector to learn to utilize the rich temporal and motion information provided by video. The developed video object detection models and software will be shared with the research community, in order to facilitate video analysis research and applications.

Document Details

Document Type
DoD Grant Award
Publication Date
Oct 16, 2018
Source ID
W911NF1710410

Entities

People

  • Yong Jae Lee

Organizations

  • Army Contracting Command
  • United States Army
  • University of California, Davis

Tags

Fields of Study

  • Computer science

Readers

  • Neural Network Machine Learning.
  • Sensor Fusion and Tracking Systems.
  • Vision Science/Vision Psychology/Cognitive Neuroscience.

Technology Areas

  • AI & ML
  • AI & ML - Neural Networks
  • Autonomy