Grounding Vision-Language Interactions in World Models by Integrating Large Neural Models with Probabilistic Programs
Abstract
We propose a world model framework that jointly represents vision and language in terms of objects, events and agents, and supportsreasoning in three core domains: visual common-sense, intuitive physics, and agency. The world model will be represented in a probabilistic programming language which can symbolically express concepts in a highly structured and hierarchical form. Vision and language will primarily interact with the world model to condition its current state. We will adopt a neuro-symbolic approach to train foundation models to translate raw visual and linguistic inputs to this program representation. Conditioning on current state, the world model can answer queries by simulating possible outcomes and drawing inference from it. The system will contain a graphics engine, a physics engine and a Theory of Mind engine to support visual rendering, physical dynamic events and planning of agent actions respectively. Overall, the neural components of the world model will allow highly variable and continuous visual and linguistic inputsvision- language inputs, and the symbolic components will complement by making the reasoning process transparent and coherent. We will test the performance of our system by developing datasets and game environments that simulate real-world scenarios.
Document Details
- Document Type
- DoD Grant Award
- Publication Date
- Apr 12, 2023
- Source ID
- N000142312355
Entities
People
- Joshua B. Tenenbaum
Organizations
- Massachusetts Institute of Technology
- Office of Naval Research
- United States Navy