Grouped Spatial-Temporal Aggregation for Efficient Action Recognition

Abstract

Temporal reasoning is an important aspect of video analysis. 3D CNN shows good performance by exploring spatial-temporal features jointly in an unconstrained way, but it also increases the computational cost a lot. Previous works try to reduce the complexity by decoupling the spatial and temporal filters. In this paper, we propose a novel decomposition method that decomposes the feature channels into spatial and temporal groups in parallel. This decomposition can make two groups focus on static and dynamic cues separately. We call this grouped spatial-temporal aggregation (GST). This decomposition is more parameter-efficient and enables us to quantitatively analyze the contributions of spatial and temporal features in different layers. We verify our model on several action recognition tasks that require temporal reasoning and show its effectiveness.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Oct 27, 2019
Accession Number
AD1153032

Entities

People

  • Alan Yuille
  • Chenxu Luo

Organizations

  • Johns Hopkins University

Tags

DTIC Thesaurus Topics

  • Artificial Intelligence Software
  • Automata Theory
  • Channel Capacity
  • Computer Science
  • Computer Vision
  • Computers
  • Computing System Architectures
  • Convolutional Neural Networks
  • Decomposition
  • Image Recognition
  • Information Processing
  • Information Science
  • Information Systems
  • Machine Learning
  • Neural Networks
  • Pattern Recognition
  • Recognition
  • Video Clips

Fields of Study

  • Computer science

Readers

  • Computer Vision.
  • Neural Network Machine Learning.