Using Large Language Models as World Models in Visual Environments

Abstract

Model-based reinforcement learning (RL) aims to mitigate excessive need for costly environment interaction by using a world model to simulate interactions. However, current vision-based world models often produce inaccurate trajectories, decreasing the reliability of the world model for planning and simulation. To resolve these challenges, we propose a world model grounded on explicit textual representations. Our method transforms visual states into tokenized textual representations with explicit semantic meaning, and utilizes large language models (LLMs) to predict the next state in textual representations. Our preliminary experimental results demonstrate that our proposed text-grounded world model achieves accurate trajectory imagination, enabling improved policy training.

Document Details

Document Type
DoD Grant Award
Publication Date
Feb 06, 2025
Source ID
FA23862514013

Entities

People

  • Hyun Oh Song

Organizations

  • Air Force Office of Scientific Research
  • Seoul National University
  • United States Air Force

Tags

Fields of Study

  • Computer science

Readers

  • Agent-Based Social Robotics and Mobile-Assisted Learning in Virtual Environments.
  • Computational Linguistics
  • Computer Vision.

Technology Areas

  • AI & ML
  • AI & ML - Bayesian Inference