YIP Embodied Scene Understanding from Data-driven Simulation and Vision-Language Models
Abstract
Embodied scene understanding enables autonomous systems to sense the surroundings and reason about their actions when navigating the real world. Unlike scene categorization and semantic segmentation in computer vision, where the understanding merely happens at the image level, embodied scene understanding aims to obtain spatial and temporal information from the image essential for the autonomous system s situational awareness and decision-making. This project aims to bring embodied scene understanding capability to autonomous agents through training vision-language models (VLMs) with offline structured data from massive in-the-wild scene videos and online interaction data from scene simulation. The resulting VLM-powered agent will achieve spatiotemporal situational awareness and counterfactual reasoning capability in the physical world. This project has three innovative research thrusts: (1) We will develop a GPT-assisted data curation pipeline to collect comprehensive scene representations from in-the-wild videos and images. (2) We will learn to generate diverse, realistic, interactive scene environments by incorporating the scene representations with a physical simulator. (3) We will design the instruction tuning and closed-loop training techniques to enable VLMs to learn from the offline structured data and the online interactions with the scene simulation to improve its spatiotemporal situational awareness and decision-making. By combining insights from real-world data with the flexibility of simulated environments, our approach aims to equip autonomousagents with robust spatiotemporal situational awareness and enable them to perform counterfactual reasoning and make informed decisions in dynamic real-world settings. This research has the potential to significantly advance autonomous systems in real-world applications, from unmanned vehicles to assistive robots in various DoD applications. Approved for Public Release
Document Details
- Document Type
- DoD Grant Award
- Publication Date
- Mar 12, 2025
- Source ID
- N000142512166
Entities
People
- Bolei Zhou
Organizations
- Office of Naval Research
- United States Navy
- University of California, Los Angeles