Learning and Parsing Video Events with Goal and Intent Prediction

Abstract

In this paper, we present a framework for parsing video events with stochastic Temporal And-Or Graph (T-AOG) and unsupervised learning of the T-AOG from video. This T-AOG represents a stochastic event grammar. The alphabet of the T-AOG consists of a set of grounded spatial relations including the poses of agents and their interactions with objects in the scene. The terminal nodes of the T-AOG are atomic actions which are specified by a number of grounded relations over image frames. An And-node represents a sequence of actions. An Or-node represents a number of alternative ways of such concatenations. The And-Or nodes in the T-AOG can generate a set of valid temporal configurations of atomic actions, which can be equivalently represented as a stochastic context-free grammar (SCFG). For each And-node we model the temporal relations of its children nodes to distinguish events with similar structures but different temporal patterns and interpolate missing portions of events. This makes the T-AOG grammar context-sensitive. We propose an unsupervised learning algorithm to learn the atomic actions, the temporal relations and the And-Or nodes under the information projection principle in a coherent probabilistic framework. We also propose an event parsing algorithm based on the T-AOG which can understand events, infer the goal of agents, and predict their plausible intended actions. In comparison with existing methods, our paper makes the following contributions. i) We represent events by a T-AOG with hierarchical compositions of events and the temporal relations between the sub-events. ii) We learn the grammar, including atomic actions and temporal relations, automatically from the video data without manual supervision. iii) Our algorithm infers the goal of agents and predicts their intents by a top-down process, handles events insertion and multi-agent events, keeps all possible interpretations of the video to preserve the ambiguities.

Open PDF

Document Details

Document Type: Technical Report
Publication Date: Mar 19, 2012
Accession Number: ADA558921

Entities

People

Benjamin Yao
Mingtao Pei
Song-Chun Zhu
Zhangzhang Si

Organizations

University of California, Los Angeles

Learning and Parsing Video Events with Goal and Intent Prediction

Abstract

Document Details

Entities

People

Organizations

Tags

Communities of Interest

DTIC Thesaurus Topics

Fields of Study

Readers

Technology Areas