Referential Grounding in Multimodal Machine Translation
Abstract
This project aimed to advance the state of the art in multimodal machine translation (MMT). MMT is an area where a text in the source language is supplemented by visual information (images or video) to be used as additional context to better understand and translate the text into a target language. The core of the advances proposed are on referential grounding, i.e., on guiding the alignment between image regions and source (and/or target) words such that the visual context can be more useful for translation.Work done during the project in covered the following directions:1. Improving supervised attention mechanisms to map source or target words to image regions, addressing both attention at encoding time (i.e. learning alignments between source words and objects in the image) and at decoding time (i.e. learning alignments between target words and objects in the image), as well as improving the underlying multimodal neural machine translation architectures and fusion strategies to use such information and exploring more recent and better types of visual features.2. Leveraging information from multiple vision-and-language tasks and datasets to improve multilingual grounding. 3. Creating resources to facilitate work on referential grounding.
Document Details
- Document Type
- Technical Report
- Publication Date
- Dec 22, 2022
- Accession Number
- AD1194121
Entities
People
- Lucia Specia
Organizations
- Imperial College London