Machine Translation Based Data Augmentation for Cantonese Keyword Spotting (Author's Manuscript)

Abstract

This paper presents a method to improve a language model for a limited-resourced language using statistical machine translation from a related language to generate data for the target language. In this work, the machine translation model is trained on a corpus of parallel Mandarin-Cantonese subtitles and used to translate a large set of Mandarin conversational telephone transcripts to Cantonese, which has limited resources. The translated transcripts are used to train a more robust language model for speech recognition and for keyword search in Cantonese conversational telephone speech. This method enables the keyword search system to detect 1.5 times more out-of-vocabulary words, and achieve 1.7% absolute improvement on actual term-weighted value.

Open PDF

Document Details

Document Type: Technical Report
Publication Date: May 19, 2016
Accession Number: AD1038524

Entities

People

Arseniy Gorin
Guangpu Huang
Jean-luc Gauvain
Lori Lamel

Machine Translation Based Data Augmentation for Cantonese Keyword Spotting (Author's Manuscript)

Abstract

Document Details

Entities

People

Tags

DTIC Thesaurus Topics

Fields of Study

Readers

Technology Areas