Speech Segregation Based on Sound Localization
Abstract
At a cocktail party, we can selectively attend to a single voice and filter out all the other acoustical interferences. How to simulate this perceptual ability remains a great challenge. This paper describes a novel machine learning approach to speech segregation, in which a target speech signal is separated from interfering sounds using spatial location cues: interaural time differences (ITD) and interaural intensity differences (IID). The auditory masking effect motivates the notion of an ideal time-frequency binary mask, which selects the target if it is stronger than the interference in a local time-frequency (T-F) unit. We observe that within a narrow frequency band, modifications to the relative strength of the target source with respect to the interference trigger systematic deviations for ITD and IID. For a given spatial configuration, this interaction produces characteristic clustering in the binaural feature space. Consequently, we perform pattern classification in order to estimate ideal binary masks. A systematic evaluation shows that the resulting system produces masks very close to ideal binary ones, and gives a significant improvement in performance over an existing approach, as quantified by changes in signal-to-noise ratio before and after segregation.
Document Details
- Document Type
- Technical Report
- Publication Date
- Jan 01, 2002
- Accession Number
- AD1001139
Entities
People
- DeLiang Wang
- Guy J. Brown
- Nicoleta Roman
Organizations
- Ohio State University