Semi-supervised Learning of Multimodal Representations
Abstract
The goal of this project was to create better representations of words by improving vector space language models with multimodal and multilingual information. A large-scale dataset of multilingual images, called MMID, was assembled, it associates images with words for 98 different languages (up to 10K words for each language, with 100 images per word). This dataset let us perform a comprehensive analysis of whether visual similarity could be used to identify translations, and the extent to which this is affected by linguistic factors like part of speech and concreteness. We studied whether MMID could be used to mitigate the geographical bias in image classification datasets like ImageNet (for example wedding is visually distinct in different regions of the world). The extent to which geography impacts the translatability across pairs of languages was investigated; factors such as shared language families, ethnic groups or shared religions have a larger impact than geography on the visual similarity and therefore translatability via images. We also collected a dataset from Wikipedia by aggregating shared images with multilingual captions giving us full sentences rather than the individual words in MMID.
Document Details
- Document Type
- Technical Report
- Publication Date
- Aug 16, 2022
- Accession Number
- AD1177263
Entities
People
- Chris Callison-burch
- Derry Wijaya
Organizations
- University of Pennsylvania