Self-Assembling-Augmenting Multimodal AI for Generalizable Reasoning

Abstract

Current state-of-the-art neural machine learning models often include human inductive-bias in terms of hand-designing the complex architecture, training curriculum, or data augmentation of these models. This is especially infeasible and sub-optimal when we have complex multimodal tasks at hand, that require multi-step reasoning across several diverse modalities such as text, images, videos, and tables. Moreover, many of the new, unseen task instances in real-world deployment may not have all these modalities being used in the same relationship or order as in the training data, and hence the model should learn to self-choose different subsets of modalities and their relations, as well as the importance weights of each modality for the new unseen task at hand. The model should also be able to adapt and change its architecture design based on the unique unseen task it faces. Furthermore, many of these current state-of-the-art models are static and cannot generalize to dynamic, continual streams of new incoming tasks and data, without facing the catastrophic forgetting problem of losing performance on the previous tasks. Finally, these models also need to be robust to adversarial or noise-related perturbations, as well as extremely low-resource data scenarios. We address generalization at all three levels (training, model, and data) via our self-mixing multi-task curriculum bandits, selfassembling modular architecture controllers, and self-data-augmentation algorithm for unseen, low-resource multimodal tasks. In thrust-1, we will develop methods for learning cross-modal representations via multi-task training with several diverse modality tasks. Next, we will develop a multi-armed bandit based controller for self-multitask-mixing that will automatically choose the subset of important modality tasks and self-learn their weighted, ordered mixture as a training curriculum. In thrust-2, we will develop a neural modular network approach that uses soft or RL-based controllers to learn how to self-assemble the layout of the different modality reasoning modules (e.g., find, transform, and compare), based on the unique unseen task at hand, instead of employing one fixed manually-designed architecture. We will also focus on continual learning paradigms where different modality tasks come in sequentially and the model needs to adapt continually so as to avoid catastrophic forgetting. Here, we will present both continual neural cell search and continual architecture-designing methods, where the model decides whether to reuse, adapt, or replicate existing modules. In thrust-3, we will orthogonally focus on auto-data-augmentation methods that can automatically generate training data for unseen and adversarial scenarios, by employing a controller that searches for optimal perturbation policies (by combining multimodal sub-policies) based on its performance rewards on the target task, or using human-generated adversaries to directly supervise the policy generator. We will extensively evaluate our self-assembling-augmenting models on diverse multimodal tasks, ranging from multimodal reasoning, multimodal navigation, multi-hop fact verification, to multimodal prediction and commonsense. We will also validate our methods on several real-world intelligence applications of interest for DoD as well as for important community outreach. In particular, we will pursue collaborations with UNC robotics and 3D-vision labs, DoD researchers and Jabs (e.g., ARO and ARL), and disability-assistance and education experts, based on our self-learning generalizable and interpretable multimodal models, that allow for robust, adaptable, and low-risk autonomous agents for remote human-robot interaction on unseen scenarios in battlefields, evacuations, and warehouses, as well as for complex multimodal reasoning in the face of low or changing resources in important community/civilian scenarios of disability assistance and education effect

Document Details

Document Type: DoD Grant Award
Publication Date: Jun 25, 2021
Source ID: W911NF2110220

Entities

People

Mohit Bansal

Organizations

Army Contracting Command
United States Army
University of North Carolina at Chapel Hill

Self-Assembling-Augmenting Multimodal AI for Generalizable Reasoning

Abstract

Document Details

Entities

People

Organizations

Tags

Fields of Study

Readers

Technology Areas