From Word-Spotting to OOV Modeling
Abstract
This paper explores one dimension along which word spotting and speech recognition differ: the nature of the background model. In word spotting, a relatively small number of keywords float on a sea of unknown words. In speech recognition, an occasional unknown word punctuates utterances that are otherwise completely within the vocabulary. Despite this difference in viewpoint, in some circumstances implementations of the two may become very similar. When transcribed data is available for a domain, word spotting benefits from the more detailed background model this can support. The manner in which the background is modeled in these cases is reminiscent of speech recognition. For example, a large vocabulary with good coverage may be extracted from the corpus, so that relatively few words in an utterance remain unmodeled. In this case, the situation is qualitatively similar to OOV modeling in a conventional speech recognizer, except that the vocabulary is strictly divided into "filler" and "keyword." This paper describes a mechanism for bootstrapping from a relatively weak background model for word spotting, where OOV words dominate, to a much stronger model where many more word or phrase clusters have been moved to the foreground and explicitly modeled. With this increase in vocabulary comes an increase in the potency of language modeling, boosting performance on the original vocabulary. This paper shows how a conventional speech recognizer can be convinced to cluster frequently occurring acoustic patterns, without requiring the existence of transcribed data.
Document Details
- Document Type
- Technical Report
- Publication Date
- Jan 01, 2001
- Accession Number
- ADA434772
Entities
People
- Paul Fitzpatrick
Organizations
- Massachusetts Institute of Technology