YIP Visual Question Answering
Abstract
Problem: As a first concrete step towards this goal, and as the next grand challenge in semantic scene understanding,the PI proposes to address the problem of Visual Question Answering (VQA). Given an image and a free-form, naturallanguage question about the image (e.g., ~What kind of store is this?~, ~How many people are waiting in the queue?~), the task is to automatically produce a concise, accurate, free-form, natural language answer (~bakery~, ~5~).Approach: Answering any possible question about an image is one of the ~holy grails~ of semantic scene understanding. The main thesis of this research program is that VQA represents not a single narrowly-defined problem(e.g., image classification) but rather a rich spectrum of semantic scene understanding problems and associatedresearch directions. Each question in VQA may lie at a different point on this spectrum: from questions that directlymap to existing well-studied computer-vision problems (~What is this room called?~ = indoor scene recognition) all the way to questions that require an integrated approach of vision (scene), language (semantics), and reasoning(understanding) over a knowledge base (~Does the pizza in the back row next to the bottle of Coke seem vegetarian?~).Consequently, proposed work will map to a sequence of waypoints along this spectrum. Motivated by addressing VQA from a variety of perspectives, this research program will generate new datasets, knowledge, and techniques in (i) pure computer vision (ii) integrating vision + language (iii) integrating vision + language + common sense (iv) interpretable models and (v) combining a portfolio of methods. In addition, novel contributions will be made to (a) training the machine to be curious and actively ask questions to learn (b) using VQA as a modality to learn more about the visual world than what existing annotation modalities allow and (c) training the machine to know what it knows and what it does not. Deep neural networks will form key building blocks in the proposed approaches.Impact: VQA is directly applicable to a variety of applications of high societal impact that involve humans elicitingsituationally-relevant information from visual data; where humans and machines must collaborate to extract informationfrom pictures. Examples include aiding visually impaired users in understanding their surroundings (~What temperatureis this oven set to?~), analysts in making decisions based on large quantities of surveillance data (~What kind of cardid the man in the red shirt drive away in?~), and interacting with a robot (~Is my laptop in my bedroom upstairs?~). Thisproposal has the potential to fundamentally improve the way visually-impaired users live their daily lives, andrevolutionize how society at large interacts with visual data.DoD relevance: Proposed work is relevant to Autonomy and Unmanned Systems, one of the the nine focus areasidentified in the Naval Science and Technology (S&T) Strategic Plan. In particular, it is relevant to Scene/ImageUnderstanding towards Perception and Intelligent Decision Making. Moreover, the Office of Naval Research (ONR)Intelligent and Autonomous Systems program as well as the Image Analysis and Understanding program list interest inbuilding sophisticated visual knowledge bases and developing methods for reasoning with images.
Document Details
- Document Type
- DoD Grant Award
- Publication Date
- Feb 03, 2017
- Source ID
- N000141712199
Entities
People
- Devi Parikh
Organizations
- Georgia Tech Research Corporation
- Office of Naval Research
- United States Navy