Grounding Vision-Language Interactions in World Models by Integrating Large Neural Models with Probabilistic Programs

Abstract

We propose a world model framework that jointly represents vision and language in terms of objects, events and agents, and supportsreasoning in three core domains: visual common-sense, intuitive physics, and agency. The world model will be represented in a probabilistic programming language which can symbolically express concepts in a highly structured and hierarchical form. Vision and language will primarily interact with the world model to condition its current state. We will adopt a neuro-symbolic approach to train foundation models to translate raw visual and linguistic inputs to this program representation. Conditioning on current state, the world model can answer queries by simulating possible outcomes and drawing inference from it. The system will contain a graphics engine, a physics engine and a Theory of Mind engine to support visual rendering, physical dynamic events and planning of agent actions respectively. Overall, the neural components of the world model will allow highly variable and continuous visual and linguistic inputsvision- language inputs, and the symbolic components will complement by making the reasoning process transparent and coherent. We will test the performance of our system by developing datasets and game environments that simulate real-world scenarios.

Document Details

Document Type
DoD Grant Award
Publication Date
Apr 12, 2023
Source ID
N000142312355

Entities

People

  • Joshua B. Tenenbaum

Organizations

  • Massachusetts Institute of Technology
  • Office of Naval Research
  • United States Navy

Tags

Fields of Study

  • Computer science

Readers

  • Agent-Based Social Robotics and Mobile-Assisted Learning in Virtual Environments.
  • Computational Linguistics
  • Computer Vision.

Technology Areas

  • AI & ML