Visual-textual Capsule Routing for Text-based Video Segmentation

Abstract

Joint understanding of vision and natural language is a challenging problem with a wide range of applications in artificial intelligence. In this work, we focus on integration of video and text for the task of actor and action video segmentation from a sentence. We propose a capsule-based approach which performs pixel-level localization based on a natural language query describing the actor of interest. We encode both the video and textual input in the form of capsules, which provide a more effective representation in comparison with standard convolution based features. Our novel visual-textual routing mechanism allows for the fusion of video and text capsules to successfully localize the actor and action. The existing works on actor-action localization are mainly focused on localization in a single frame instead of the full video. Different from existing works, we propose to perform the localization on all frames of the video. To validate the potential of the proposed network for actor and action video localization, we extend an existing actor-action dataset (A2D) with annotations for all the frames. The experimental evaluation demonstrates the effectiveness of our capsule network for text selective actor and action localization in videos. The proposed method also improves upon the performance of the existing state-of-the art works on single frame-based localization.

Open PDF

Document Details

Document Type: Technical Report
Publication Date: Jun 14, 2020
Accession Number: AD1152535

Entities

People

Bruce Mcintosh
Kevin Duarte
Mubarak Ali Shah
Yogesh S Rawat

Organizations

University of Central Florida

Visual-textual Capsule Routing for Text-based Video Segmentation

Abstract

Document Details

Entities

People

Organizations

Tags

DTIC Thesaurus Topics

Fields of Study

Readers

Technology Areas