Using Web Text to Improve Keyword Spotting in Speech

Abstract

For low resource languages, collecting sufficient training data to build acoustic and language models is time consuming and often expensive. But large amounts of text data, such as online newspapers, web forums or online encyclopedias, usually exist for languages that have a large population of native speakers. This text data can be easily collected from the web and then used to both expand the recognizers vocabulary and improve the language model. One challenge is normalizing and filtering the web data for a specific task. In this paper, we investigate the use of online text resources to improve the performance of speech recognition specifically for the task of keyword spotting. For the five languages provided in the base period of the IARPA BABEL project, we automatically collected text data from the web using only Limited LP resources. We then compared two methods for filtering the web data, one based on perplexity ranking and the other based on out-of-vocabulary (OOV) word detection. By integrating the web text into our systems, we observed significant improvements in keyword spotting accuracy for four out of the five languages. The best approach obtained an improvement in actual term weighted value (ATWV) of 0.0424 compared to a baseline system trained only on LimitedLP resources. On average, ATWV was improved by 0.0243 across five languages.

Document Details

Document Type
Technical Report
Publication Date
Dec 08, 2013
Accession Number
AD1173616

Entities

People

  • Alexander I. Rudnicky
  • Ankur Gandhe
  • Florian Metze
  • Ian Lane
  • Long Qin
  • Matthias Eck

Organizations

  • Carnegie Mellon University

Tags

DTIC Thesaurus Topics

  • Accuracy
  • Automated Speech Recognition
  • Computer Languages
  • Conversion
  • Decoding
  • Department Of Defense
  • Detection
  • Encyclopedias
  • Errors
  • False Alarms
  • Filters
  • Filtration
  • Hybrid Systems
  • Internet
  • Language
  • Measurement
  • Natural Language Processing
  • Newspapers
  • Standards
  • Vocabulary

Fields of Study

  • Computer science
  • Education

Readers

  • Computational Linguistics
  • Library and Information Science
  • Speech Processing/Speech Recognition.

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval
  • AI & ML - Machine Translation