Speech Segregation Based on Sound Localization

Abstract

At a cocktail party, we can selectively attend to a single voice and filter out all the other acoustical interferences. How to simulate this perceptual ability remains a great challenge. This paper describes a novel machine learning approach to speech segregation, in which a target speech signal is separated from interfering sounds using spatial location cues: interaural time differences (ITD) and interaural intensity differences (IID). The auditory masking effect motivates the notion of an ideal time-frequency binary mask, which selects the target if it is stronger than the interference in a local time-frequency (T-F) unit. We observe that within a narrow frequency band, modifications to the relative strength of the target source with respect to the interference trigger systematic deviations for ITD and IID. For a given spatial configuration, this interaction produces characteristic clustering in the binaural feature space. Consequently, we perform pattern classification in order to estimate ideal binary masks. A systematic evaluation shows that the resulting system produces masks very close to ideal binary ones, and gives a significant improvement in performance over an existing approach, as quantified by changes in signal-to-noise ratio before and after segregation.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 2002
Accession Number
AD1001139

Entities

People

  • DeLiang Wang
  • Guy J. Brown
  • Nicoleta Roman

Organizations

  • Ohio State University

Tags

Fields of Study

  • Computer science

Readers

  • Speech Processing/Speech Recognition.
  • Statistical inference.

Technology Areas

  • AI & ML
  • AI & ML - Bayesian Inference
  • Space
  • Space - Space Objects