YIP Embodied Scene Understanding from Data-driven Simulation and Vision-Language Models

Abstract

Embodied scene understanding enables autonomous systems to sense the surroundings and reason about their actions when navigating the real world. Unlike scene categorization and semantic segmentation in computer vision, where the understanding merely happens at the image level, embodied scene understanding aims to obtain spatial and temporal information from the image essential for the autonomous system s situational awareness and decision-making. This project aims to bring embodied scene understanding capability to autonomous agents through training vision-language models (VLMs) with offline structured data from massive in-the-wild scene videos and online interaction data from scene simulation. The resulting VLM-powered agent will achieve spatiotemporal situational awareness and counterfactual reasoning capability in the physical world. This project has three innovative research thrusts: (1) We will develop a GPT-assisted data curation pipeline to collect comprehensive scene representations from in-the-wild videos and images. (2) We will learn to generate diverse, realistic, interactive scene environments by incorporating the scene representations with a physical simulator. (3) We will design the instruction tuning and closed-loop training techniques to enable VLMs to learn from the offline structured data and the online interactions with the scene simulation to improve its spatiotemporal situational awareness and decision-making. By combining insights from real-world data with the flexibility of simulated environments, our approach aims to equip autonomousagents with robust spatiotemporal situational awareness and enable them to perform counterfactual reasoning and make informed decisions in dynamic real-world settings. This research has the potential to significantly advance autonomous systems in real-world applications, from unmanned vehicles to assistive robots in various DoD applications. Approved for Public Release

Document Details

Document Type: DoD Grant Award
Publication Date: Mar 12, 2025
Source ID: N000142512166

Entities

People

Bolei Zhou

Organizations

Office of Naval Research
United States Navy
University of California, Los Angeles

YIP Embodied Scene Understanding from Data-driven Simulation and Vision-Language Models

Abstract

Document Details

Entities

People

Organizations

Tags

Fields of Study

Readers

Technology Areas