The Effect of Training Data Set Composition on the Performance of a Neural Image Caption Generator

Abstract

This research seeks to determine how many images of a particular object in a training data set are necessary to achieve caption quality saturation in neural image caption generators. Understanding the relationship between caption quality and the size and composition of training data sets could improve efficiency in model training and lead to the development of optimized data sets for different tasks. We hypothesize that increasing the exposure of a neural network to an object will improve its performance, up to a point, after which the caption quality will saturate; and that this may vary based on the objects visual homogeneity. We trained several image captioning models, using an existing code Neuraltalk2, on subsets of the Microsoft Common Objects in Context data set, which contained a precise number of some common object categories (e.g., cat and pizza). The performance with different levels of exposure to the selected objects was compared using the Metric for Evaluation of Translation with Explicit Ordering (METEOR) and Consensus-Based Image Description Evaluation (CIDEr) automated scoring metrics. The data indicate that increasing the quantity of images of a particular object in the training data set improved the performance up to 1,500 images, but not beyond that.

Open PDF

Document Details

Document Type: Technical Report
Publication Date: Sep 01, 2017
Accession Number: AD1039145

Entities

People

Abigail Wilson
Adrienne Raglin

Organizations

United States Army Research Laboratory

The Effect of Training Data Set Composition on the Performance of a Neural Image Caption Generator

Abstract

Document Details

Entities

People

Organizations

Tags

Communities of Interest

DTIC Thesaurus Topics

Fields of Study

Readers

Technology Areas