Modular Representations for Vision and Language
Abstract
Cognitive Science studies show that humans have incredibly rich and sophisticated interactions between vision and language which exploit commonsense knowledge about the world including intuitive physics and social knowledge. Human cognitive skills include the ability to easily to generalize to novel situations and to deal with a combinatorial large number of possibilities. Our proposal is based on the conjecture that modularity and representation play key roles in achieving these cognitive abilities by enabling abstraction and compositionality. We argue that purely big data approaches, which dominate much current research on vision-language, are inadequate by themselves due to their failure to generalize compositionally and their unexplainable failure modes. We will and exploit recent advances of big data, particularly the learning of sophisticated feature vectors.More specifically, we propose to develop a computational model of vision-language interactions, called Modular Vision-Language Representations (MVLR) which captures these cognitive abilities. The input to this system is visual content (images or videos) and a complex question. The system would process the joint of these multi-modal information and output an answer. MVLR is based on the conjecture that modularity, abstraction, and structured representations of the world are critical to capturing these abilities. MVLR will take advantage of the abundance of big data(particularly for modeling commonsense knowledge) but we argue that big data by itself is not sufficient to capture these cognitiveabilities without the modular representations that we will develop. We will evaluate MVLR by its ability to answer complex questions about images and image sequences. These questions will include open set action recognition questions, such as #is there a man riding a bicycle and holding a heavy black briefcase#? An important property of our MVLR is that it will be explainable and hence will not suffer from the inexplicable failure modes which occur for some data-driven vision-language models (if we understand the failure modes it will be possible to correct them). MVLR will enable transparent reasoning and generalization to open domain. It is particularly important that MVLR will be robust and able to generalize to out-of-distribution testing data. It will enable collaborative dialogues between humans and agents with questions about images including not only purely visual information (e.g., object identify)but also information that a human could extract using commonsense knowledge about the world like intuitive physics and social knowledge. MVLR will be tested on a range of existing vision-language datasets but, since none of these is fully adequate to test all the vision-language phenomena that our model can address, we will also construct a new dataset. When evaluating MVLR we will use challenging tests, such as out-of-distribution testing and adversarial examiners, in addition to standard performance measures. We will compare our model, when possible, with alternative approaches (no existing algorithms can perform all the tasks we propose). It is important to ensure that MVLR performs at least as well as conventional methods when tested on standard benchmarks and performance measures (when conventional methods perform well) but also outperforms them in more challenging situations.
Document Details
- Document Type
- DoD Grant Award
- Publication Date
- Jul 24, 2023
- Source ID
- N000142312641
Entities
People
- Alan Yuille
Organizations
- Johns Hopkins University
- Office of Naval Research
- United States Navy