YIP Visual Question Answering

Abstract

Problem: As a first concrete step towards this goal, and as the next grand challenge in semantic scene understanding,the PI proposes to address the problem of Visual Question Answering (VQA). Given an image and a free-form, naturallanguage question about the image (e.g., ~What kind of store is this?~, ~How many people are waiting in the queue?~), the task is to automatically produce a concise, accurate, free-form, natural language answer (~bakery~, ~5~).Approach: Answering any possible question about an image is one of the ~holy grails~ of semantic scene understanding. The main thesis of this research program is that VQA represents not a single narrowly-defined problem(e.g., image classification) but rather a rich spectrum of semantic scene understanding problems and associatedresearch directions. Each question in VQA may lie at a different point on this spectrum: from questions that directlymap to existing well-studied computer-vision problems (~What is this room called?~ = indoor scene recognition) all the way to questions that require an integrated approach of vision (scene), language (semantics), and reasoning(understanding) over a knowledge base (~Does the pizza in the back row next to the bottle of Coke seem vegetarian?~).Consequently, proposed work will map to a sequence of waypoints along this spectrum. Motivated by addressing VQA from a variety of perspectives, this research program will generate new datasets, knowledge, and techniques in (i) pure computer vision (ii) integrating vision + language (iii) integrating vision + language + common sense (iv) interpretable models and (v) combining a portfolio of methods. In addition, novel contributions will be made to (a) training the machine to be curious and actively ask questions to learn (b) using VQA as a modality to learn more about the visual world than what existing annotation modalities allow and (c) training the machine to know what it knows and what it does not. Deep neural networks will form key building blocks in the proposed approaches.Impact: VQA is directly applicable to a variety of applications of high societal impact that involve humans elicitingsituationally-relevant information from visual data; where humans and machines must collaborate to extract informationfrom pictures. Examples include aiding visually impaired users in understanding their surroundings (~What temperatureis this oven set to?~), analysts in making decisions based on large quantities of surveillance data (~What kind of cardid the man in the red shirt drive away in?~), and interacting with a robot (~Is my laptop in my bedroom upstairs?~). Thisproposal has the potential to fundamentally improve the way visually-impaired users live their daily lives, andrevolutionize how society at large interacts with visual data.DoD relevance: Proposed work is relevant to Autonomy and Unmanned Systems, one of the the nine focus areasidentified in the Naval Science and Technology (S&T) Strategic Plan. In particular, it is relevant to Scene/ImageUnderstanding towards Perception and Intelligent Decision Making. Moreover, the Office of Naval Research (ONR)Intelligent and Autonomous Systems program as well as the Image Analysis and Understanding program list interest inbuilding sophisticated visual knowledge bases and developing methods for reasoning with images.

Document Details

Document Type: DoD Grant Award
Publication Date: Feb 03, 2017
Source ID: N000141712199

Entities

People

Devi Parikh

Organizations

Georgia Tech Research Corporation
Office of Naval Research
United States Navy

YIP Visual Question Answering

Abstract

Document Details

Entities

People

Organizations

Tags

Fields of Study

Readers

Technology Areas