Visual Reasoning via Spatio-temporal Scene Graphs

Abstract

Problem Description Computer vision has a daunting task of structuring and bringing meaning to the hundreds of billions of images and videos generated each year. Whether the challenge is to recognize and contextualize the contents of billions of images, to search for more hours of videos than humans can possibly watch, or to task personal robots to work with humans, advanced visual intelligence lies at the core of these technologies. As a step towards making progress towards on these applications, scene graphs enabled computer visionto move away from treating images as a collection of discrete objects or attributes and to contextualize them as a graph of relationships among interconnected objects. Scene graphs enabled progress in image captioning, image retrieval, visual question answering, relationship modeling and image generation. Unfortunately, scene graph enabled progress has been stymied by their (1) limitation to images, (2) isolation from knowledge bases, and (3) static number of annotations. To overcome the three limitations of scene graphs mentioned above, we believe the following three key elements are required: (1) a structured graphical representation of 3D space and videos, (2) a distillation ofvisual concepts into knowledge bases and incorporation of external knowledge bases, and (3) the capability for agents to self-supervise their knowledge and grow organically. Research Goals Our research is geared towards enhancing computer vision by removing the limitations of scene graphs that were mentioned above. In particular, we propose to conduct research consisting of 3 thrusts:1. expand scene graphs into the 3D space and temporal video space;2. distilling knowledge from scene graphs and incorporating external knowledge;3. build self-supervised agent that can grow its knowledge.Expected Impact The proposed research expands the capabilities of computer vision when tackling knowledge based high-level visual tasks. Our research will: (1) develop and launch two new knowledge-based structured vision datasets, which will capture common sense visual knowledge and rich data of complex interactions between elements 3D spaces and videos; (2) advance the principles and technologies for large scale distillation of visual knowledge, knowledge-base construction and knowledge-based visual inference; (3) provide a system that can continuously learn to expand its own knowledge about the world in a self supervised manner.

Document Details

Document Type: DoD Grant Award
Publication Date: Aug 20, 2019
Source ID: N000141912477

Entities

People

Fei-Fei Li

Organizations

Office of Naval Research
Stanford University
United States Navy

Visual Reasoning via Spatio-temporal Scene Graphs

Abstract

Document Details

Entities

People

Organizations

Tags

Fields of Study

Readers

Technology Areas