Video Instance Segmentation Tracking with a Modified VAE Architecture

Abstract

We propose a modified variational autoencoder (VAE) architecture built on top of Mask R-CNN for instance-level video segmentation and tracking. The method builds a shared encoder and three parallel decoders, yielding three disjoint branches for predictions of future frames, object detection boxes, and instance segmentation masks. To effectively solve multiple learning tasks, we introduce a Gaussian Process model to enhance the statistical representation of VAE by relaxing the prior strong independent and identically distributed (iid) assumption of conventional VAEs and allowing potential correlations among extracted latent variables. The network learns embedded spatial interdependence and motion continuity in video data and creates a representation that is effective to produce high-quality segmentation masks and track multiple instances in diverse and unstructured videos. Evaluation on a variety of recently introduced datasets shows that our model outperforms previous methods and achieves the new best in class performance.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jun 14, 2020
Accession Number
AD1152532

Entities

People

  • Chung-ching Lin
  • Linglin He
  • Rogerio Feris
  • Ying Hung

Organizations

  • International Business Machines Corporation (Armonk, NY)
  • Rutgers University

Tags

Communities of Interest

  • Autonomy

DTIC Thesaurus Topics

  • Artificial Intelligence
  • Artificial Intelligence Software
  • Computer Vision
  • Computers
  • Computing System Architectures
  • Decoding
  • Detection
  • Gaussian Processes
  • Image Processing
  • Image Recognition
  • Information Processing
  • Information Science
  • Information Systems
  • Multitarget Tracking
  • Neural Networks
  • Pattern Recognition
  • Recognition

Fields of Study

  • Computer science

Readers

  • Neural Network Machine Learning.