Development and Evaluation of Audio-Visual ASR: A Study on Connected Digit Recognition
Abstract
We present our findings from audio-visual speech recognition experiments for connected digit recognition in noisy environments. We derive hybrid (geometric- and appearance-based) visual lip features using a real-time lip tracking algorithm that we proposed previously. Using a small single-speaker corpus modeled after the TIDIGITS database, we build whole-word HMMs using both single-stream and 2-stream modeling strategies. For the 2-stream HMM method, we use stream-dependent weights to adjust the relative contributions of the two feature streams based on the acoustic SNR level. The 2-stream HMM art consistently gave the lowest WER, with an error reduction of 83% at -3dB SNR level compared to the acoustic-only is baseline. Visual-only ASR WER at 6.85% was also achieved. A real-time system prototype was developed for concept demonstration.
Document Details
- Document Type
- Technical Report
- Publication Date
- Jun 12, 2002
- Accession Number
- ADP014020
Entities
People
- Michael T. Chan