Temporal Asynchronicity Modeling by Product HMMS for Audio-Visual Speech Recognition

Abstract

There have been higher demands recently for Automatic Speech Recognition (ASR) systems able to operate robustly in acoustically noisy environments. This paper proposes a method to effectively integrate audio and visual information in audio-visual (bi-modal) ASR systems. Such integration inevitably necessitates modeling of the synchronization and asynchronization of the audio and visual information. To address the time lag and correlation problems in individual features between speech and lip movements, we introduce a type of integrated HMM modeling of audio-visual information based on a family of a product HMM. The proposed model can represent state synchronicity not only within a phoneme but also between phonemes. Furthermore, we also propose a rapid stream weight optimization based on GPD algorithm for noisy bi-modal speech recognition. Evaluation experiments show that the proposed method improves the recognition accuracy for noisy speech. in SNR=0db our proposed method attained 16% higher performance compared to a product HMMs without the synchronicity re-estimation.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jun 12, 2002
Accession Number
ADP014022

Entities

People

  • Satoshi Nakamura

Tags

DTIC Thesaurus Topics

  • Accuracy
  • Algorithms
  • Automated Speech Recognition
  • Boundaries
  • Data Displays
  • Databases
  • Environment
  • Gray Scale
  • Images
  • Optimization
  • Recognition
  • Standards
  • Test And Evaluation
  • Training
  • Transitions
  • Two Dimensional
  • Workshops

Fields of Study

  • Computer science

Readers

  • Adaptive Control and Estimation with Uncertainty in Dynamic Systems.
  • Speech Processing/Speech Recognition.
  • Systems Analysis and Design

Technology Areas

  • AI & ML