Machine Translation Based Data Augmentation for Cantonese Keyword Spotting (Author's Manuscript)

Abstract

This paper presents a method to improve a language model for a limited-resourced language using statistical machine translation from a related language to generate data for the target language. In this work, the machine translation model is trained on a corpus of parallel Mandarin-Cantonese subtitles and used to translate a large set of Mandarin conversational telephone transcripts to Cantonese, which has limited resources. The translated transcripts are used to train a more robust language model for speech recognition and for keyword search in Cantonese conversational telephone speech. This method enables the keyword search system to detect 1.5 times more out-of-vocabulary words, and achieve 1.7% absolute improvement on actual term-weighted value.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
May 19, 2016
Accession Number
AD1038524

Entities

People

  • Arseniy Gorin
  • Guangpu Huang
  • Jean-luc Gauvain
  • Lori Lamel

Tags

DTIC Thesaurus Topics

  • Artificial Intelligence Software
  • Automated Speech Recognition
  • Computational Science
  • Data Science
  • Demographic Cohorts
  • Detection
  • Dictionaries
  • Language
  • Machine Translation
  • Neural Networks
  • Personality
  • Recognition
  • Recurrent Neural Networks
  • Standards
  • Training
  • Translations
  • Vocabulary

Fields of Study

  • Education
  • Engineering

Readers

  • Computational Linguistics
  • Speech Processing/Speech Recognition.

Technology Areas

  • AI & ML
  • AI & ML - Machine Translation
  • AI & ML - Neural Networks