NOVA - A Neuro-Symbolic Vision-Language Framework for Multimodal Human-Machine Interactions

Abstract

When humans perform a task, they make sense of the world through visual and language descriptions, conduct logical and commonsense reasoning, understand and execute instructions, and proactively request clarification when they are uncertain. Inspired by this, we propose NoVa, a Neuro-symbolic Vision-language (VL) framework that enables an AI agent to recognize new concepts and perform complextasks through multimodal interactions with humans in a reliable way. NoVa integrates three innovations: 1) object-level VL grounding that makes sense of perceptual visual inputs at different semantic levels, from low-level object attributes to high-level properties such as affordance and functionality; 2) neuro-symbolic learning that combine neural networks and symbolic rules to conduct probabilistic reasoning during training and inference phases 3) uncertainty estimation to measure the confidence of the model predictions. The final system integrates all the aforementioned components and enablesan AI agent to learn from few-shot examples and associated language descriptions. The developed techniques will be tested on various real-world tasks involving rare objects and new concepts including object detection, multimodal information extraction, and visual question answering. If successful, NoVA will make a transformative impact in reducing data collection efforts for training AI agents. This is especially important for DoD applications, where new objects, scenes, and targets are frequently encountered. In such cases, acquiring large-scale annotated data with consistent annotated labels is tedious and often impossible. In contrast, NoVa will enable a human-machine teaming framework, where domain experts such as naval officers can directly teach AI agents through examples, instructions, and describing objects, events, and semantic roles in human language. The capability can be used as a tool for pervasive, wide-area visual surveillance for military bases andreal-time intelligence information collection to facilitate command decision-making. Since human-assisted monitoring is costly and often unreliable, systems that can reduce human involvement and improve reliability are of interest.Approved for Public Release

Document Details

Document Type: DoD Grant Award
Publication Date: Aug 11, 2023
Source ID: N000142312780

Entities

People

Kai Wei Chang

Organizations

Office of Naval Research
United States Navy
University of California, Los Angeles

NOVA - A Neuro-Symbolic Vision-Language Framework for Multimodal Human-Machine Interactions

Abstract

Document Details

Entities

People

Organizations

Tags

Fields of Study

Readers

Technology Areas