A Comparison of Signal-Processing Front Ends for Automatic Speech Recognition

Abstract

The first stage of any system for automatic speech recognition (ASR) is a signal-processing front end that converts a sampled speech waveform into a more suitable representation for later processing. Several front ends are compared, three of which are based on knowledge about the human auditory system. The performance of an ASR system with these front ends was compared to a control mel filter bank (MFB)-based cepstral representation in clean speech and with speech degraded by noise and spectral variability. Using the TI-105 isolated word data base, it was found that auditory front ends performed comparably to MFB cepstra, sometimes slightly better in noise. With MFB cepstral recognition error rates ranging from 0.5% to 26.9%, depending on signal-to-noise ratio (SNR) , auditory models could perform as high as four percentage points better. With speech degraded by linear filtering, where MFB cepstra showed error rates ranging from 0.5% to 3.1%, auditory outputs could improve performance by as much as 0.4% for conditions with high baseline error rates. This performance gain comes at a significant computational expense-approximately one-third real time for MFB cepstra as opposed to as much as over 100 times real time for auditory models. These results disagree with previous studies that suggest considerably more improvement with auditory models. However, these earlier studies used a linear predictive coding (LPC)-based control front end, which is shown to perform significantly worse than MFB cepstra under noisy conditions (e.g., 2.7% error rate with mel-cepstra vs. 25.3% with LPC at 18-dB SNR). Data-reduction techniques such as principal component analysis (PCA) and linear discriminant analysis (LDA) were also evaluated. PCA provided no gain in noise and slight gain with spectral variability.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jul 18, 1994
Accession Number
ADA284962

Entities

People

  • C. R. Jankowski Jr.
  • H-d. H. Vo
  • R. P. Lippmann

Organizations

  • Massachusetts Institute of Technology

Tags

Communities of Interest

  • Energy and Power Technologies

DTIC Thesaurus Topics

  • Automated Speech Recognition
  • Automatic
  • Computer Programming
  • Data Reduction
  • Data Science
  • Databases
  • Discriminant Analysis
  • Ear
  • Filters
  • Filtration
  • Frequency
  • Hidden Markov Models
  • Information Science
  • Linear Filtering
  • Mathematical Filters
  • Recognition
  • Signal Processing

Fields of Study

  • Engineering

Readers

  • Mathematics or Statistics
  • Neural Network Machine Learning.
  • Speech Processing/Speech Recognition.

Technology Areas

  • AI & ML