Multi-Modal sensory Fusion with Application to Audio-Visual Speech Recognition

Abstract

In this work we consider the bimodal fusion problem in audio-visual speech recognition. A novel sensory fission architecture based on the coupled hidden Markov models (CHMMs) is presented. CHMMs are directed graphical models of stochastic processes and are a special type of dynamic Bayesian networks. The proposed fusion architecture allows us to address the statistical modeling and the fission of audio-visual speech in a unified framework. Furthermore, the architecture is capable of capturing the asynchronous and temporal inter-modal dependencies between the two information channels. We describe a model transformation strategy to facilitate inference and learning in CHMMs. Results from audio-visual speech recognition experiments confirmed the superior capability of the proposed fusion architecture.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jun 12, 2002
Accession Number
ADP014018

Entities

People

  • Stephen M. Chu
  • Thomas Huang

Organizations

  • University of Illinois Urbana–Champaign

Tags

Communities of Interest

  • Materials and Manufacturing Processes

DTIC Thesaurus Topics

  • Automated Speech Recognition
  • Bayesian Networks
  • Engineering
  • Hidden Markov Models
  • Lead Time
  • Learning
  • Markov Models
  • Models
  • Noise
  • Observation
  • Probabilistic Models
  • Probability
  • Recognition
  • Stochastic Processes
  • Technical Information Centers
  • Universities

Fields of Study

  • Computer science

Readers

  • Computer Vision.
  • Neural Network Machine Learning.

Technology Areas

  • AI & ML