Speaker Recognition Using Real vs. Synthetic Parallel Data for DNN Channel Compensation

Abstract

Recent work has shown large performance gains using denoising DNNs for speech processing tasks under challenging acoustic conditions. However, training these DNNs requires large amounts of parallel multichannel speech data which can be impractical or expensive to collect. The effective use of synthetic parallel data as an alternative has been demonstrated for several speech technologies including automatic speech recognition and speaker recognition (SR). This paper demonstrates that denoising DNNs trained with real Mixer 2 multichannel data perform only slightly better than DNNs trained with synthetic multichannel data for microphone SR on Mixer 6. Large reductions in pooled error rates of 50% EER and 30% min DCF are achieved using DNNs trained on real Mixer 2 data. Nearly the same performance gains are achieved using synthetic data generated with a limited number of room impulse responses (RIRs) and noise sources derived from Mixer 2. Using RIRs from three publicly available sources used in the Kaldi ASpIRE recipe yields somewhat lower pooled gains of 34% EER and25% min DCF. These results confirm the effective use of synthetic parallel data for DNN channel compensation even when the RIRs used for synthesizing the data are not particularly well-matched to the task.

Open PDF

Document Details

Document Type: Technical Report
Publication Date: Sep 08, 2016
Accession Number: AD1033607

Entities

People

Douglas A. Reynolds
Frederick S. Richardson
Jennifer T. Melot
Michael S. Brandstein

Organizations

Massachusetts Institute of Technology

Speaker Recognition Using Real vs. Synthetic Parallel Data for DNN Channel Compensation

Abstract

Document Details

Entities

People

Organizations

Tags

Communities of Interest

DTIC Thesaurus Topics

Readers

Technology Areas