Attributed Graph Models for Scene Interpretation

Abstract

Recently, the computer vision community has focused on “scaling up” to accommodate more object categories, as well as building powerful object detectors, larger datasets for training and evaluating, and attempts at deeper semantic descriptions. But systems delivering rich descriptions about meaningful interactions among objects and parts of objects usually do not even approach human capabilities, and there is no evident way to measure performance and progress. In particular, the adopted metrics for sub-tasks in standard “challenges,” such as false positive and false negative rates for object detection and localization, do not apply to the richer descriptions that human beings can provide using contextual reasoning, for example deciding whether a car is “parked,” or a person is “leaving” a building, or two people are “walking and talking” together. Something new is required. In particular, the field lacks a mathematically unified approach for designing and testing scene annotation machines. Contextual reasoning requires scene models and scene models provide the basis for more effective and meaningful testing of annotation machines. The resulting synergies will be exploited in the proposed research. Building and testing will be synthesized in a single Bayesian framework by formulating scene interpretation as a dynamic process of “graph discovery.” The common core is a probability distribution over attributed graphs with vertices and edges annotated relative to a pre-determined semantic vocabulary. This scene model provides a priori likelihoods for alternative scene configurations. Image annotation evolves in discrete steps by assembling a graph. The link between building and testing, and the mechanism for sequential processing, is the representation of a graph as answers to binary questions about the existence and attributes of vertices and edges. What differs among tasks and design strategies is the data model. In testing an existing vision system, the prior model serves as a “query generator” for streams of “unpredictable” questions in an information-theoretic sense; the data are the correct answers for the image under scrutiny provided by a human conducting this “Turing test.” For building a new system, several alternative strategies are considered, ranging from replacing the true answers by imperfect classifiers to “compositional systems” operating at the pixel level; again, the scene model remains the same. In all cases, testing can inform construction by identifying recurring ambiguous contexts using “question histories.” Perhaps the main challenge is the context vs. computation dilemma: achieving human-like performance requires extensive contextual reasoning, but the more context, the more computation. Emulating human capacities in the interpretation of sensory signals is a cornerstone of artificial intelligence. Whereas the mathematical and computational challenges of the proposed program are formidable, the dividends are potentially considerable because virtually the same framework for vision could be applied to building and testing other intelligent systems for parsing other types of sensory data.

Document Details

Document Type
DoD Grant Award
Publication Date
Aug 12, 2016
Source ID
N000141512267

Entities

People

  • Donald Geman

Organizations

  • Johns Hopkins University
  • Office of Naval Research
  • United States Navy

Tags

Fields of Study

  • Computer science

Readers

  • Neural Network Machine Learning.
  • Software Engineering.
  • Theoretical Analysis.

Technology Areas

  • AI & ML