Development and Evaluation of Audio-Visual ASR: A Study on Connected Digit Recognition

Abstract

We present our findings from audio-visual speech recognition experiments for connected digit recognition in noisy environments. We derive hybrid (geometric- and appearance-based) visual lip features using a real-time lip tracking algorithm that we proposed previously. Using a small single-speaker corpus modeled after the TIDIGITS database, we build whole-word HMMs using both single-stream and 2-stream modeling strategies. For the 2-stream HMM method, we use stream-dependent weights to adjust the relative contributions of the two feature streams based on the acoustic SNR level. The 2-stream HMM art consistently gave the lowest WER, with an error reduction of 83% at -3dB SNR level compared to the acoustic-only is baseline. Visual-only ASR WER at 6.85% was also achieved. A real-time system prototype was developed for concept demonstration.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jun 12, 2002
Accession Number
ADP014020

Entities

People

  • Michael T. Chan

Tags

Communities of Interest

  • Air Platforms

DTIC Thesaurus Topics

  • Accuracy
  • Active Shape Models
  • Algorithms
  • Automated Speech Recognition
  • Computer Vision
  • Databases
  • Errors
  • Feature Extraction
  • Hidden Markov Models
  • Markov Models
  • Models
  • Multimedia
  • Observation
  • Recognition
  • Signal Processing
  • Test And Evaluation
  • Workshops

Fields of Study

  • Computer science

Readers

  • Neural Network Machine Learning.
  • Speech Processing/Speech Recognition.

Technology Areas

  • AI & ML