Multi-view scene representations and their computational architectures

Abstract

Visual representations are functions of (past) images of a scene (training set) that are “useful” to answer questions about it once future images (test set) become available. The project casts the design and analysis of visual representations in terms of basic principles of statistical decision theory and information theory, where “useful” is measured in terms of reduction of the uncertainty of questions about the scene. “Questions” can pertain to a decision or classification task (detection, localization, recognition, categorization of objects) or to a control task, where the “answer” belongs to the continuum (in which direction to move?). Such basic principles include sufficiency, minimality, invariance, and completeness. The analysis of existing schemes within the context of such a framework would enable comparing existing schemes beyond the mere empirical test of a given algorithm on a given dataset. This would allow the engineer to understand and predict performance, by highlighting the conditions under which a given algorithm is expected to perform to specification. Furthermore, it would allow not just comparing, but improving existing schemes, by instantiating better approximations of ideal representations. Preliminary evidence points to the fact that simple changes to existing popular descriptors, such as SIFT, suggested by the analysis, yield significant performance improvements. The theory suggests that the size of the domain of “receptive fields” should be decoupled from the “scale” of the descriptor, which at first sight seems to go counter to the teaching of harmonic analysis and scale-space theory. Such theory, however, was developed in support of compression and storage tasks, whereas the task in vision is not to reproduce the source signal, but to perform decisions based on it, where the data has been corrupted by intrinsically non-linear nuisance factors such as occlusions. The project intends to extend “domain-size pooling” to other forms of descriptors, from deformable parts models to convolutional neural networks. The proposal also formulates conjectures on the properties of such networks that, if validates, could help explain their recent empirical success, as well as highlight potential limitations. The work plan is articulated into 11 tasks, that range from the purely analytical (making some of the formal arguments articulated in the proposal rigorous) to the applied (making the implementation of DSPSIFT efficient and publicly available). The scientific significance stems from placing the task of designing or learning representations into established analytical frameworks, which enables both understanding commonly practice methods in relation to each other, and improving them by attempting to improve the approximation of the optimal representation, which is intractable for most cases. The proposal operates under the assumptions of the Lambert-Ambient (LA) model, that stipulates that most of the scene can be approximated as Lambertian, and that surfaces are piecewise smooth and multiply connected. While simplistic, this model is far more complex than that implicitly assumed by most existing descriptors, as pointed out in the proposal. It is also the simplest model that captures the phenomenology of image formation, including occlusions and scaling phenomena. The proposal builds on results of prior ONR effort on the detection of “detachable objects,” defined as subsets of the image domain that back-project onto portions of the scene that are partially surrounded by the medium. As articulated in the proposal, occlusions force the representation to be the union of local regions, and detachable objects inform how to re-assemble objects from such local regions. It is expected that the integration of this effort with detachable objects will come to fruition towards the end of the project in a natural manner, leading to better models of the scene for the purpose of visual decision and control tasks.

Document Details

Document Type: DoD Grant Award
Publication Date: Aug 12, 2016
Source ID: N000141512261

Entities

People

Stefano Soatto

Organizations

Office of Naval Research
United States Navy
University of California, Los Angeles

Multi-view scene representations and their computational architectures

Abstract

Document Details

Entities

People

Organizations

Tags

Fields of Study

Readers

Technology Areas