Low-Resource Speech Translation of Urdu to English Using Semi-Supervised Part-of-Speech Tagging and Transliteration

Abstract

This paper describes the construction of ASR and MT systems for translation of speech from Urdu into English. As both Urdu pronunciation lexicons and Urdu-English bitexts are sparse, we employ several techniques that make use of semi-supervised annotation to improve ASR and MT training. Specifically, we describe 1) the construction of a semi-supervised HMM-based part-of-speech tagger that is used to train factored translation models and 2) the use of an I-HMM-based transliterator from which we derive a spelling-to-pronunciation model for Urdu used in ASR training. We describe experiments performed for both ASR and MT training in the context of the Urdu-to-English task of the NIST MT08 Evaluation and we compare methods making use of additional annotation with standard statistical MT and ASR baselines.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 2008
Accession Number
ADA519247

Entities

People

  • A. R. Aminzadeh
  • Wade Shen

Organizations

  • Massachusetts Institute of Technology

Tags

Communities of Interest

  • Materials and Manufacturing Processes

DTIC Thesaurus Topics

  • Abstracts
  • Air Force
  • Air Force Research Laboratories
  • Automated Speech Recognition
  • Consonants
  • Data Sets
  • Hidden Markov Models
  • Iterations
  • Language
  • Machine Translation
  • Markov Models
  • Models
  • Probability
  • Standards
  • Test And Evaluation
  • Test Sets
  • Translations

Readers

  • Computational Linguistics
  • Neural Network Machine Learning.
  • Speech Processing/Speech Recognition.

Technology Areas

  • AI & ML
  • AI & ML - Machine Translation
  • AI & ML - Neural Networks