YIP Embodied Scene Understanding from Data-driven Simulation and Vision-Language Models

Abstract

Embodied scene understanding enables autonomous systems to sense the surroundings and reason about their actions when navigating the real world. Unlike scene categorization and semantic segmentation in computer vision, where the understanding merely happens at the image level, embodied scene understanding aims to obtain spatial and temporal information from the image essential for the autonomous system s situational awareness and decision-making. This project aims to bring embodied scene understanding capability to autonomous agents through training vision-language models (VLMs) with offline structured data from massive in-the-wild scene videos and online interaction data from scene simulation. The resulting VLM-powered agent will achieve spatiotemporal situational awareness and counterfactual reasoning capability in the physical world. This project has three innovative research thrusts: (1) We will develop a GPT-assisted data curation pipeline to collect comprehensive scene representations from in-the-wild videos and images. (2) We will learn to generate diverse, realistic, interactive scene environments by incorporating the scene representations with a physical simulator. (3) We will design the instruction tuning and closed-loop training techniques to enable VLMs to learn from the offline structured data and the online interactions with the scene simulation to improve its spatiotemporal situational awareness and decision-making. By combining insights from real-world data with the flexibility of simulated environments, our approach aims to equip autonomousagents with robust spatiotemporal situational awareness and enable them to perform counterfactual reasoning and make informed decisions in dynamic real-world settings. This research has the potential to significantly advance autonomous systems in real-world applications, from unmanned vehicles to assistive robots in various DoD applications. Approved for Public Release

Document Details

Document Type
DoD Grant Award
Publication Date
Mar 12, 2025
Source ID
N000142512166

Entities

People

  • Bolei Zhou

Organizations

  • Office of Naval Research
  • United States Navy
  • University of California, Los Angeles

Tags

Fields of Study

  • Computer science

Readers

  • Agent-Based Social Robotics and Mobile-Assisted Learning in Virtual Environments.
  • Computer Vision.
  • Enterprise Information Systems Architecture and Joint Command Capability Interoperability Support.

Technology Areas

  • AI & ML
  • Autonomy
  • Autonomy - Autonomous System Control
  • Autonomy - Human-Robot Interaction