High-Level Structured Activity Models
Abstract
What happened over the course of the day in this video captured with a wearable camera? How did it differ from typical events? What story is being “told” in this short YouTube video? Having seen a variety of sequences containing a certain event, what are the common atomic actions and actors that define it? Such video analysis questions have significant implications for consumer, defense, and scientific applications where users are faced with video content at an unprecedented scale. To answer such questions, a computer vision system must possess a high-level, structured representation of activity. However, existing research in activity analysis is largely focused on learning discriminative low-level cues that can predict predefined action category labels, essentially pattern matching with space-time video descriptors. In this project, we propose to explore structured representations of activity. We will develop techniques that integrate high-level external knowledge about the narrative of events with sophisticated statistical learning algorithms. We will apply them to recognize and summarize activity in video and images. The research will proceed along three main directions: • Visual narratives: In this component, we will investigate ways to discover and exploit the narrative “story-like” structure within video data. Detecting and exploiting narrative structure demands a model for high-level visual influence. We aim to move beyond today’s co-occurrence based metrics of object/action relationships, to instead predict the extent to which one entity leads to another. We will develop data-driven graph algorithms that can reveal connections between detected objects and atomic actions on a large scale. In addition, we will consider how external knowledge of scripts—partially ordered structured sequences of participants and events—can help inform the representations we learn. • Representing objects and actors: The other essential component in creating structured representations for video narratives is to represent the key actors (people) and objects. We will develop descriptions of the video that reveal which objects and people interact with one another, where an interaction may or may not entail a literal physical connection. We will investigate methods to predict which objects are involved in an interaction using cues such as estimated gaze and body pose. Such predictions should form a useful prior as to where the system should pay attention in the recognition task. In addition, we will investigate methods for object-level video segmentation, such that we can identify space-time volumes likely to contain full objects, even if those objects are unfamiliar to the system. • Connecting instantaneous snapshots and video: Finally, we will examine the interplay between static snapshots and video sequences for activity understanding. We will explore how unlabeled video can serve as a useful prior to understand poses and activities in static photos. In addition, we will examine the question of when a static snapshot is “enough” to understand an activity.
Document Details
- Document Type
- DoD Grant Award
- Publication Date
- Aug 12, 2016
- Source ID
- N000141512291
Entities
People
- Kristen Grauman
Organizations
- Office of Naval Research
- United States Navy
- University of Texas at Austin