Multi-Modal Video Summarization with User Input (5.2.1)

Abstract

Video data is explosively growing due to the ubiquitous acquisition capabilities. While the "big video data" is a great source for information discovery and extraction, the computational challenges are unparalleled. Intelligent algorithms for automatic video summarization, comprehension, etc. thus have re-emerged as a pressing need in such context. Progress on this subject will significantly increase the military mission capabilities to quick and decisively act based on the information contained in the videos, which otherwise would not be possible. We propose to study multi-modal video summarization with user input in this proposal. The summarizer to be developed is capable of taking users text queries into account and thus generating personalized summaries of the videos. The summaries could be in the form of a shortened video of the original lengthy one, a short text paragraph to describe the main gist in the video, or both. An exemplar use case of this new video summarization scheme is that a military Site may be of more interest for some purpose than the others. Our algorithm accepts such user input and summarizes different types of sites, people, and actions with the user-tailored granularities. In contrast, prior video summarizers often transcend the actual content of the videos and are not responsive to any user intervention. e propose three main thrusts in this proposal, making integrated efforts to tackle the user focused video summarization. First of all, we propose a novel probabilistic framework as the overarching model for this project. This framework builds upon our prior works on determinantal point processes (DPPs), which are quite versatile in promoting diversity in the video summary, and meanwhile, leveraging the recent progress in the deep learning literature, parametrizes the DPP kernels through a memory network. The proposed approach does not rely on any costly user supervision about which video shots are supposed to be responsive to the user input. Instead, we use the memory network to implicitly attend the user query onto different frames within each video shot. Furthermore, we study unsupervised concept discovery methods to automatically explore the semantic concepts from any given datasets. Specifically, we cluster the video clips into candidate concepts employing the newly developed dominant-set theory in computer vision. The concepts automatically mined are machine detectable and user understandable, providing a natural channel for users to interact with our video summarizer. Finally, we propose our work for a more crucial task: to generate textual summaries for long videos. This is a brand new way to explore video summarization, which will help the users to quickly access large amount of visual data without actually watching them. We propose to explore two alternate threads: Using the state-of-the-art deep learning techniques e.g. Long Short Term Memory (LSTM) with spatial and temporal attentions we generate texts or captions for each video clip and then compress the texts using another deep learning module. Another thread is to first summarize the video into relevant shots and then generate texts for only those shots.

Document Details

Document Type
DoD Grant Award
Publication Date
Jun 25, 2019
Source ID
W911NF1910356

Entities

People

  • Mubarak Ali Shah

Organizations

  • Army Contracting Command
  • United States Army
  • University of Central Florida

Tags

Fields of Study

  • Computer science

Readers

  • Agent-Based Social Robotics and Mobile-Assisted Learning in Virtual Environments.
  • Computational Linguistics
  • Neural Network Machine Learning.

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval
  • AI & ML - Neural Networks