Automatic Diacritization of Arabic for Acoustic Modeling in Speech Recognition

Abstract

Automatic recognition of Arabic dialectal speech is a challenging task because Arabic dialects are essentially spoken varieties. Only few dialectal resources are available to date; moreover, most available acoustic data collections are transcribed without diacritics. Such a transcription omits essential pronunciation information about a word, such as short vowels. In this paper we investigate various procedures that enable us to use such training data by automatically inserting the missing diacritics into the transcription. These procedures use acoustic information in combination with different levels of morphological and contextual constraints. We evaluate their performance against manually diacritized transcriptions. In addition, we demonstrate the effect of their accuracy on the recognition performance of acoustic models trained on automatically diacritized training data.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 2004
Accession Number
ADA457846

Entities

People

  • Dimitra Vergyri
  • Katrin Kirchhoff

Organizations

  • University of Washington

Tags

DTIC Thesaurus Topics

  • Acoustic Signals
  • Automated Speech Recognition
  • Automatic
  • Computational Science
  • Consonants
  • Errors
  • Hidden Markov Models
  • Language
  • Linguistics
  • Markov Models
  • Models
  • Probability
  • Probability Distributions
  • Random Variables
  • Recognition
  • Speech
  • Standards

Readers

  • Computational Linguistics
  • Speech Processing/Speech Recognition.

Technology Areas

  • AI & ML
  • AI & ML - Machine Translation