Semi-supervised Learning of Multimodal Representations

Abstract

The goal of this project was to create better representations of words by improving vector space language models with multimodal and multilingual information. A large-scale dataset of multilingual images, called MMID, was assembled, it associates images with words for 98 different languages (up to 10K words for each language, with 100 images per word). This dataset let us perform a comprehensive analysis of whether visual similarity could be used to identify translations, and the extent to which this is affected by linguistic factors like part of speech and concreteness. We studied whether MMID could be used to mitigate the geographical bias in image classification datasets like ImageNet (for example wedding is visually distinct in different regions of the world). The extent to which geography impacts the translatability across pairs of languages was investigated; factors such as shared language families, ethnic groups or shared religions have a larger impact than geography on the visual similarity and therefore translatability via images. We also collected a dataset from Wikipedia by aggregating shared images with multilingual captions giving us full sentences rather than the individual words in MMID.

Open PDF

Document Details

Document Type: Technical Report
Publication Date: Aug 16, 2022
Accession Number: AD1177263

Entities

People

Chris Callison-burch
Derry Wijaya

Organizations

University of Pennsylvania

Semi-supervised Learning of Multimodal Representations

Abstract

Document Details

Entities

People

Organizations

Tags

Communities of Interest

DTIC Thesaurus Topics

Readers

Technology Areas