Visual Speech Feature Extraction From Natural Speech for Multi-modal ASR

Abstract

Improving the accuracy of speech recognition technology by addition of visual information is the key approach to multi-modal ASR research. In this work, we address two important issues, which are lip tracking and the visual speech feature extraction algorithm. In order to utilize the multi-modal ASR for natural speech, the visual front end algorithm must extract affine and lighting condition invariant visual speech features. This paper focuses on both the lip tracking algorithm using the Bayesian framework and a novel pixel based visual speech feature extraction algorithm based on kurtosis measures of the frequency profile of the local image blocks.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jun 12, 2002
Accession Number
ADP014023

Entities

People

  • John N. Gowdy
  • Sabri Gurbuz

Organizations

  • Clemson University

Tags

Communities of Interest

  • Materials and Manufacturing Processes

DTIC Thesaurus Topics

  • Accuracy
  • Acoustic Signals
  • Algorithms
  • Artificial Intelligence
  • Automated Speech Recognition
  • Bayesian Networks
  • Data Sets
  • Electron Microscopes
  • Electron Microscopy
  • Feature Extraction
  • Image Processing
  • Machine Learning
  • Probability
  • Random Variables
  • Recognition
  • Test Sets
  • Two Dimensional

Fields of Study

  • Computer science

Readers

  • Computer Vision.
  • Speech Processing/Speech Recognition.

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval