Massively Parallel Network Architectures for Automatic Recognition of Visual Speech Signals

Abstract

This research sought to produce a massively-parallel network architecture that could interpret speech signals from video recordings of human talkers. This report summarizes the project's results: (1) A corpus of video recordings from two human speakers was analyzed with image processing techniques ans used as the data for this study; (2) We demonstrated that a feedforward network could be trained to categorize vowels from these talkers. The performance was comparable to that of the nearest neighbors techniques and to trained humans on the same data; (3) We developed a novel approach to sensory fusion by training a network to transform from facial images to short-time spectral amplitude envelopes. This information can be used to increase the signal-to-noise ratio and hence the performance of acoustic speech recognition systems in noisy environments; (4) We explored the use of recurrent networks to perform the same mapping for continuous speech. Results of this project demonstrate the feasibility of adding a visual speech recognition component to enhance existing speech recognition systems. Such a combined system could be used in noisy environments, such as cockpits, where improved communication is needed. This demonstration of presymbolic fusion of visual and acoustic speech signals is consistent with our current understanding of human speech perception.

Open PDF

Document Details

Document Type: Technical Report
Publication Date: Jan 01, 1990
Accession Number: ADA226968

Entities

People

Moise Goldstein
Terrence J. Sejnowski

Organizations

Johns Hopkins University

Massively Parallel Network Architectures for Automatic Recognition of Visual Speech Signals

Abstract

Document Details

Entities

People

Organizations

Tags

Communities of Interest

DTIC Thesaurus Topics

Fields of Study

Readers

Technology Areas