A Structured Model for Action Detection

Abstract

A dominant paradigm for learning-based approaches in computer vision is training generic models, such as ResNet for image recognition, or I3D for video understanding, on large datasets and allowing them to discover the optimal representation for the problem at hand. While this is an obviously attractive approach, it is not applicable in all scenarios. We claim that action detection is one such challenging problem - the models that need to be trained are large, and the labeled data is expensive to obtain. To address this limitation, we propose to incorporate domain knowledge into the structure of the model to simplify optimization. In particular, we augment a standard I3D network with a tracking module to aggregate long term motion patterns, and use a graph convolutional network to reason about interactions between actors and objects. Evaluated on the challenging AVA dataset, the proposed approach improves over the I3D baseline by 5.5 mAP and over the state-of-the-art by 4.8 mAP.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jun 16, 2019
Accession Number
AD1152105

Entities

People

  • Cordelia Schmid
  • Martial Hebert
  • Pavel Tokmakov
  • Yubo Zhang

Organizations

  • Carnegie Mellon University
  • Google

Tags

DTIC Thesaurus Topics

  • Artificial Intelligence Software
  • Computer Vision
  • Computers
  • Computing System Architectures
  • Convolutional Neural Networks
  • Deep Learning
  • Detection
  • Detectors
  • Dimensionality Reduction
  • Feature Extraction
  • Image Recognition
  • Information Science
  • Machine Learning
  • Neural Networks
  • Recognition
  • Supervised Machine Learning
  • Video
  • Video Clips

Fields of Study

  • Computer science

Readers

  • Computer Vision.
  • Neural Network Machine Learning.
  • Systems Analysis and Design

Technology Areas

  • AI & ML
  • AI & ML - Machine Learning Algorithms
  • AI & ML - Neural Networks