Versatile Visual Description with Knowledge-Enriched Inductive Biases and Multi-Source Lifelong Learning

Abstract

Research problem: Understanding the visual content of ever-growing images and videos and describing it with accurate, coherent natural language is of ubiquitous use in practice and is a fundamental challenge in artificial intelligence. Despite the impressive advances in recent years in machine learning, computer vision, and natural language processing, most current image description literature focuses on describing a single type of images, particularly regular natural images, by generating a single short sentence (a.k.a caption) for each input image. Moreover, training of the captioning models has often heavily relied on large amount of direct supervision data, even though collecting the labeled data can be expensive and hard to scale. The lack of rich structured prior knowledge and the incapability of continuous learning in the presence of dynamic data and changing environment also lead to sub-optimal performance, yielding pale, irrelevant, or event inaccurate descriptions. Technical approaches: To address these fundamental difficulties in the real-world visual description applications, we propose a research project that aims to develop a novel, comprehensive framework which can: 1) produce full image and multi-image descriptions of complex contents, 2) learn with vast weak forms of supervisions without costly manual annotations, 3) incorporate rich structured prior knowledge, 4) improve continuously with online feedback and growing information, and 5) support fast development and extension of further functionalities. We will develop advanced, general-purpose modeling architectures and learning paradigms, enabling efficient learning of visual description models that encode large-scale domain knowledge with effective inductive biases, leverage rich weak supervisions with flexible learning algorithms, continuously improve from user feedbacks, and adapt to new domains for fast scalable deployment. The library of modeling and learning functionalities will be integrated in a comprehensive toolkit, which is designed to be modularized and extensible in support of fast development and deployment of diverse real applications. Anticipated outcome: If successful, our proposed techniques are expected to enable the desired applications with minimal human labors such as exhaustive annotation, feature engineering, and manual model upgrading. Our approach will thus alleviate the crucial methodological and engineering bottlenecks of development and deployment, enabling versatile, performant, and efficient visual description systems that fulfills the diverse need in practice. Impact on NGA’s capabilities: It has become a pressing need to provide automatic timely and accurate analysis of geospatial imagery and related information in support of anomaly detection, vulnerability prediction, and national security in general. Our work will enable more efficient analysis and description of geospatial and generic images, by automatically recognize and characterize objects, relations, activities, and other features of interests, and timely summarize with text reports for downstream applications. The proposed work will be undertaken by the SAILING Lab at the School of Computer Science, Carnegie Mellon University, led by PI Professor Eric Xing. Dr. Xing is a Fellow of the AAAI and IEEE, and is one of the leading experts in Machine Learning.

Document Details

Document Type: DoD Grant Award
Publication Date: Oct 06, 2020
Source ID: HM04762010002

Entities

People

Eric P. Xing

Organizations

Carnegie Mellon University
National Geospatial-Intelligence Agency

Versatile Visual Description with Knowledge-Enriched Inductive Biases and Multi-Source Lifelong Learning

Abstract

Document Details

Entities

People

Organizations

Tags

Fields of Study

Readers

Technology Areas