On Using Written Language Training Data for Spoken Language Modeling

Abstract

We attempted to improve recognition accuracy by reducing the inadequacies of the lexicon and language model. Specifically we address the following three problems: (1) the best size for the lexicon, (2) conditioning written text for spoken language recognition, and (3) using additional training outside the text distribution. We found that increasing the lexicon 20,000 words to 40,000 words reduced the percentage of words outside the vocabulary from over 2% to just 0.2%, thereby decreasing the error rate substantially. The error rate on words already in the vocabulary did not increase substantially. We modified the language model training text by applying rules to simulate the differences between the training text and what people actually said. Finally, we found that using another three years' of training text - even without the appropriate preprocessing, substantially improved the language model We also tested these approaches on spontaneous news dictation and found similar improvements. 3, we explore the effect of increasing the vocabulary size on recognition accuracy in an unlimited vocabulary task. Second, in Section 4, we consider ways to model the differences between the language model Training text and the way people actually speak. And third, in Section 5, we show that simply increasing the amount of language model training helps significantly. 2. THE WSJ CORPUS The November 1993 ARPA Continuous Speech Recognition (CSR) evaluations was based on speech and language taken from the Wall Street Journal (WSJ). The standard language model training text was estimated from about 35 million words of text extracted from the WSJ from 1987 to 1989.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 1994
Accession Number
ADA460657

Entities

People

  • F. Kubala
  • G. Chou
  • G. Zavaliagkos
  • J. Makhoul
  • Luong N. Nguyen
  • Robert E. Schwartz

Organizations

  • BBN Technologies

Tags

Communities of Interest

  • Energy and Power Technologies

DTIC Thesaurus Topics

  • Accuracy
  • Automated Speech Recognition
  • Errors
  • Information Operations
  • Language
  • Markov Chains
  • Military Research
  • Neural Networks
  • Numbers
  • Probability
  • Recognition
  • Sequences
  • Standards
  • Test And Evaluation
  • Test Sets
  • Training
  • Vocabulary

Readers

  • Computational Modeling and Simulation
  • Speech Processing/Speech Recognition.

Technology Areas

  • AI & ML
  • AI & ML - Bayesian Inference
  • AI & ML - Machine Translation
  • AI & ML - Neural Networks