Trainable Videorealistic Speech Animation

Abstract

I describe how to create with machine learning techniques a generative, video realistic, speech animation module. A human subject is first recorded using a video camera as he/she utters a pre-determined speech corpus. After processing the corpus automatically, a visual speech module is learned from the data that is capable of synthesizing the human subject's mouth littering entirely novel utterances that were not recorded in the original video. The synthesized utterance is re-composited onto a background sequence which contains natural head and eye movement. The final output is video- realistic in the sense that it looks like a video camera recording of the subject. At run time, the input to the system can be either real audio sequences or synthetic audio produced by a text-to-speech system, as long as they have been phonetically aligned.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jun 01, 2002
Accession Number
ADA456049

Entities

People

  • Tony F. Ezzat

Organizations

  • Massachusetts Institute of Technology

Tags

Communities of Interest

  • Autonomy

DTIC Thesaurus Topics

  • Abstracts
  • Cameras
  • Computer Science
  • Eye Movements
  • Information Operations
  • Instructions
  • Machine Learning
  • Sequences
  • Standards
  • Theoretical Computer Science
  • Trajectories
  • Video
  • Video Cameras

Fields of Study

  • Computer science

Readers

  • Agent-Based Social Robotics and Mobile-Assisted Learning in Virtual Environments.
  • Computer Science/Computer Engineering/Data Science/Digital Signal Processing.
  • Speech Processing/Speech Recognition.

Technology Areas

  • AI & ML
  • AI & ML - Neural Networks