Grounding Vision-Language Interactions in World Models by Integrating Large Neural Models with Probabilistic Programs

Abstract

We propose a world model framework that jointly represents vision and language in terms of objects, events and agents, and supportsreasoning in three core domains: visual common-sense, intuitive physics, and agency. The world model will be represented in a probabilistic programming language which can symbolically express concepts in a highly structured and hierarchical form. Vision and language will primarily interact with the world model to condition its current state. We will adopt a neuro-symbolic approach to train foundation models to translate raw visual and linguistic inputs to this program representation. Conditioning on current state, the world model can answer queries by simulating possible outcomes and drawing inference from it. The system will contain a graphics engine, a physics engine and a Theory of Mind engine to support visual rendering, physical dynamic events and planning of agent actions respectively. Overall, the neural components of the world model will allow highly variable and continuous visual and linguistic inputsvision- language inputs, and the symbolic components will complement by making the reasoning process transparent and coherent. We will test the performance of our system by developing datasets and game environments that simulate real-world scenarios.

Document Details

Document Type: DoD Grant Award
Publication Date: Apr 12, 2023
Source ID: N000142312355

Entities

People

Joshua B. Tenenbaum

Organizations

Massachusetts Institute of Technology
Office of Naval Research
United States Navy

Grounding Vision-Language Interactions in World Models by Integrating Large Neural Models with Probabilistic Programs

Abstract

Document Details

Entities

People

Organizations

Tags

Fields of Study

Readers

Technology Areas