Improving the Capacity of Language Recognition Systems to Handle Rare Languages Using Radio Broadcast Data

Abstract

The total duration of the project is divided into 2 phases The first phase planned for the period May 2008 to Oct 2008. The second phase planned for Nov 2008 to April 2008. It has the following 3 work-packages (WP). This project counts on Voice of America (VOA) data collection performed by LDC in the several past years. The VOA data will need to be completed with the available meta-information, especially about the language(s) contained. The following step will consist of cleaning the data and selecting relevant speech information, as we are aware of the automatically acquired data being quite dirty for the purposes of LRE: 1. automatic segmentation into speech, music and noise segments, while only speech will be retained. The speech/music segmentation was the topic of a diploma thesis finished at our department [Hovorka2006] and is available for use in this project. 2. voice activity detection (VAD) that will be performed by our phoneme recognizer [Schwarz2006] with all phoneme classes linked to "speech" class. This setup was successfully used in a wide range of applications such as speaker recognition, language recognition, speech transcription and spoken term detection and evaluated in several NIST evaluations. 3. detecting telephone conversations in the data. In this project, we will mainly investigate the data that is as closed as possible to the target domain: conversational telephone speech (CTS). Therefore, we will concentrate on the segments with detected telephone speech (people calling in the broadcast) as we believe these should correspond the best to CTS. Initial work on Thai done for NIST LRE 2007 has shown a yield of 8 hours of telephone conversations from approximately 400 hours of VOA data downloaded from the Internet archive of VOA.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 2011
Accession Number
ADA535721

Entities

People

  • Lukas Burget

Organizations

  • Brno University of Technology

Tags

Communities of Interest

  • Energy and Power Technologies
  • Space

DTIC Thesaurus Topics

  • Accuracy
  • Acquisition
  • Algorithms
  • Databases
  • Dimensionality Reduction
  • Factor Analysis
  • Hidden Markov Models
  • Identification
  • Information Science
  • Information Systems
  • Language
  • Neural Networks
  • Probability
  • Recognition
  • Standards
  • Statistics
  • Supervised Machine Learning

Readers

  • Speech Processing/Speech Recognition.
  • Technical Research and Report Writing.

Technology Areas

  • AI & ML
  • AI & ML - Machine Translation