On Using Written Language Training Data for Spoken Language Modeling

Abstract

We attempted to improve recognition accuracy by reducing the inadequacies of the lexicon and language model. Specifically we address the following three problems: (1) the best size for the lexicon, (2) conditioning written text for spoken language recognition, and (3) using additional training outside the text distribution. We found that increasing the lexicon 20,000 words to 40,000 words reduced the percentage of words outside the vocabulary from over 2% to just 0.2%, thereby decreasing the error rate substantially. The error rate on words already in the vocabulary did not increase substantially. We modified the language model training text by applying rules to simulate the differences between the training text and what people actually said. Finally, we found that using another three years' of training text - even without the appropriate preprocessing, substantially improved the language model We also tested these approaches on spontaneous news dictation and found similar improvements. 3, we explore the effect of increasing the vocabulary size on recognition accuracy in an unlimited vocabulary task. Second, in Section 4, we consider ways to model the differences between the language model Training text and the way people actually speak. And third, in Section 5, we show that simply increasing the amount of language model training helps significantly. 2. THE WSJ CORPUS The November 1993 ARPA Continuous Speech Recognition (CSR) evaluations was based on speech and language taken from the Wall Street Journal (WSJ). The standard language model training text was estimated from about 35 million words of text extracted from the WSJ from 1987 to 1989.

Open PDF

Document Details

Document Type: Technical Report
Publication Date: Jan 01, 1994
Accession Number: ADA460657

Entities

People

F. Kubala
G. Chou
G. Zavaliagkos
J. Makhoul
Luong N. Nguyen
Robert E. Schwartz

Organizations

BBN Technologies

On Using Written Language Training Data for Spoken Language Modeling

Abstract

Document Details

Entities

People

Organizations

Tags

Communities of Interest

DTIC Thesaurus Topics

Readers

Technology Areas