Learning and Parsing Video Events with Goal and Intent Prediction

Abstract

In this paper, we present a framework for parsing video events with stochastic Temporal And-Or Graph (T-AOG) and unsupervised learning of the T-AOG from video. This T-AOG represents a stochastic event grammar. The alphabet of the T-AOG consists of a set of grounded spatial relations including the poses of agents and their interactions with objects in the scene. The terminal nodes of the T-AOG are atomic actions which are specified by a number of grounded relations over image frames. An And-node represents a sequence of actions. An Or-node represents a number of alternative ways of such concatenations. The And-Or nodes in the T-AOG can generate a set of valid temporal configurations of atomic actions, which can be equivalently represented as a stochastic context-free grammar (SCFG). For each And-node we model the temporal relations of its children nodes to distinguish events with similar structures but different temporal patterns and interpolate missing portions of events. This makes the T-AOG grammar context-sensitive. We propose an unsupervised learning algorithm to learn the atomic actions, the temporal relations and the And-Or nodes under the information projection principle in a coherent probabilistic framework. We also propose an event parsing algorithm based on the T-AOG which can understand events, infer the goal of agents, and predict their plausible intended actions. In comparison with existing methods, our paper makes the following contributions. i) We represent events by a T-AOG with hierarchical compositions of events and the temporal relations between the sub-events. ii) We learn the grammar, including atomic actions and temporal relations, automatically from the video data without manual supervision. iii) Our algorithm infers the goal of agents and predicts their intents by a top-down process, handles events insertion and multi-agent events, keeps all possible interpretations of the video to preserve the ambiguities.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Mar 19, 2012
Accession Number
ADA558921

Entities

People

  • Benjamin Yao
  • Mingtao Pei
  • Song-Chun Zhu
  • Zhangzhang Si

Organizations

  • University of California, Los Angeles

Tags

Communities of Interest

  • Energy and Power Technologies

DTIC Thesaurus Topics

  • Algorithms
  • Alphabets
  • Ambiguity
  • Artificial Intelligence Software
  • Bayesian Networks
  • Computer Vision
  • Context Free Grammars
  • Data Mining
  • Detection
  • Grammars
  • Language
  • Learning
  • Machine Learning
  • Models
  • Probability
  • Recognition
  • Unsupervised Machine Learning

Fields of Study

  • Computer science

Readers

  • Agent-Based Social Robotics and Mobile-Assisted Learning in Virtual Environments.
  • Criminal Law
  • Neural Network Machine Learning.

Technology Areas

  • AI & ML