Binary and Ratio Time-frequency Masks for Robust Speech Recognition
Abstract
A time-varying Wiener filter extracts a speech signal from a mixture using the a priori signal-to-noise ratio in a local time-frequency unit. We estimate this ratio using a binaural processor and derive a ratio time-frequency mask. This mask is used to extract the speech, which is then fed to a conventional speech recognizer operating in the cepstral domain. We compare the performance of this system with a missing-data recognizer that operates in the spectral domain using the time frequency units dominated by speech. For use by the missing-data recognizer, the same processor is used to estimate an ideal time-frequency binary mask, which selects the speech if it is stronger than the interference in a local time-frequency unit. We find that the performance of the missing-data recognizer is better on a small vocabulary recognition task but the performance of the conventional recognizer is substantially better when the vocabulary size is larger.
Document Details
- Document Type
- Technical Report
- Publication Date
- May 01, 2004
- Accession Number
- AD1001184
Entities
People
- DeLiang Wang
- Nicoleta Roman
- Soundararajan Srinivasan
Organizations
- Ohio State University