Self-Supervised Scene Understanding With Object-Centric Dorso-Ventral Neural Networks

Abstract

The last decades advances in deep learning, GPU computing, and big annotated datas on, and many practical applications in object detectionand activity recognition have been fielded. However, current approaches requ ire large amounts oftraining data, which is frequently not available. In the absence of such data, recognition systemsdo not gener ds hierarchical, compositional structure: Scenesare composed of objects, objects have parts, and human behavior breaks down into e vents that havesubstructure.Our approach takes inspiration from processing in the human brain which is supported bya hierarchical ly-organized series of anatomically distinguishable cortical areas, divided into twoseparate pathwaysthe ventral and dorsal visual streams. Significant progress has been achievedin modeling the ventral pathway by using deep convolutional neural networks trained to solveobject recognition tasks. The dorsal stream is less well understood but has been associated withspatiotemporal processing and holistic scene understanding. The dorsal and ventral pathways areknown to interact ubiquitously throughout the course of visua l processing. We believe that creatinga neural network model that brings in recognition and spatiotemporal reasoning in the samefr amework will be a good start to understanding this relationship, in addition to providing the nextgeneration of technical solutions to central problems of computer vision.We have created a multi-disciplinary team working at the intersection of AI, cognition, and neuroscience. Our approach has two themes. The first theme is going beyond the shift-invariantcomputational layers of CNNs to a co mpositional object-centric hierarchy, the hierarchical scenegraph (HSG). In the HSG, a hierarchy of nodes represents object entit ies. Graph edges encodevarious relationships between entities. HSGs describe the 3D geometry of entities in real-worldscenes, but their nodes are spatially registered in the 2D image from which they are derived. TheHSGs key technical innovation is a suite of learnable graph propagation and aggregation operationsthat transform spatially-uniform input (e.g., an image), to spatially-nonuni form graphs. Wetrain the HSG with self-supervised learning by using dynamic image data directly as supervision.We learn in a compo sitional fashion, progressively building up new knowledge in terms of knowledgealready gained. Our second theme is moving beyond 2D static datasets to active interactionwith 3D environments. PIs on this project have pioneered new interactive and realistic 3D sim ulators.We use these simulation environments to simultaneously train and test (a) HSG models, (b)human subjects for cognitive expe riments, (c) macaques for neuroscience experiments. Our modeldesign can thus directly inform and be informed by the findings in cog nitive and neuroscienceexperiments.We believe this strategy will lead to robust computer vision because: 1) In HSGs, representatio nand learning are compositional. 2) HSGs supports top-down processing, inspired by recurrentconnections in the visual brain. 3) HS Gs are a crude model of interactions between the dorsal andventral streams as 3D objects are bound to a 2D spatial frame. 4) Traini ng with active perception ina 3D simulation environment is a direct path to building in 3D viewpoint invariance and toleranceof oc clusion. 5) HSGs are end-to-end trainable, where high-level tasks propagate backward togood feature choices throughout.We expect t his approach to lead to major advances in visual scene understanding and activityrecognition, with applications of interest to the DoD such as improved systems for situationalawareness.

Document Details

Document Type
DoD Grant Award
Publication Date
Sep 03, 2021
Source ID
N000142112801

Entities

People

  • Jitendra Malik

Organizations

  • Office of Naval Research
  • United States Navy
  • University of California Regents

Tags

Fields of Study

  • Computer science

Readers

  • Agent-Based Social Robotics and Mobile-Assisted Learning in Virtual Environments.
  • Neural Network Machine Learning.

Technology Areas

  • AI & ML
  • AI & ML - Neural Networks