Visual-textual Capsule Routing for Text-based Video Segmentation

Abstract

Joint understanding of vision and natural language is a challenging problem with a wide range of applications in artificial intelligence. In this work, we focus on integration of video and text for the task of actor and action video segmentation from a sentence. We propose a capsule-based approach which performs pixel-level localization based on a natural language query describing the actor of interest. We encode both the video and textual input in the form of capsules, which provide a more effective representation in comparison with standard convolution based features. Our novel visual-textual routing mechanism allows for the fusion of video and text capsules to successfully localize the actor and action. The existing works on actor-action localization are mainly focused on localization in a single frame instead of the full video. Different from existing works, we propose to perform the localization on all frames of the video. To validate the potential of the proposed network for actor and action video localization, we extend an existing actor-action dataset (A2D) with annotations for all the frames. The experimental evaluation demonstrates the effectiveness of our capsule network for text selective actor and action localization in videos. The proposed method also improves upon the performance of the existing state-of-the art works on single frame-based localization.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jun 14, 2020
Accession Number
AD1152535

Entities

People

  • Bruce Mcintosh
  • Kevin Duarte
  • Mubarak Ali Shah
  • Yogesh S Rawat

Organizations

  • University of Central Florida

Tags

DTIC Thesaurus Topics

  • Algorithms
  • Artificial Intelligence
  • Artificial Intelligence Software
  • Computer Languages
  • Computer Vision
  • Computers
  • Detection
  • Information Processing
  • Information Systems
  • Intelligence Community (United States)
  • Language
  • Natural Language Processing
  • Natural Languages
  • Network Architecture
  • Neural Networks
  • Pattern Recognition
  • Recognition
  • Video Frames

Fields of Study

  • Computer science

Readers

  • Agent-Based Social Robotics and Mobile-Assisted Learning in Virtual Environments.
  • Computational Linguistics
  • Image Processing and Computer Vision.

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval