Using Large Language Models as World Models in Visual Environments
Abstract
Model-based reinforcement learning (RL) aims to mitigate excessive need for costly environment interaction by using a world model to simulate interactions. However, current vision-based world models often produce inaccurate trajectories, decreasing the reliability of the world model for planning and simulation. To resolve these challenges, we propose a world model grounded on explicit textual representations. Our method transforms visual states into tokenized textual representations with explicit semantic meaning, and utilizes large language models (LLMs) to predict the next state in textual representations. Our preliminary experimental results demonstrate that our proposed text-grounded world model achieves accurate trajectory imagination, enabling improved policy training.
Document Details
- Document Type
- DoD Grant Award
- Publication Date
- Feb 06, 2025
- Source ID
- FA23862514013
Entities
People
- Hyun Oh Song
Organizations
- Air Force Office of Scientific Research
- Seoul National University
- United States Air Force