Low-Bitrate Speech Compression with a Glottal Pulse Autoencoder

Abstract

In this report, a convolutional neural network with a custom voicing layer is used to compress English speech to a rate of 1500 bits per second. The model consists of an autoencoder that converts speech input into a quantized latent space and then decodes with voice-like glottal sound and noise layers multiplied by a formant mask. The input and output of the model are magnitude short-time Fourier transform spectrograms. Each 32-ms frame of the input is approximately mapped to an element of the bottleneck layer and quantized to 48 bits via 4 additive layers of 12-bit vector codebooks.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 17, 2024
Accession Number
AD1224751

Entities

People

  • Michael S. Lee

Organizations

  • United States Army Research Laboratory

Tags

Fields of Study

  • Computer science

Readers

  • Computer Programming and Software Development.
  • Speech Processing/Speech Recognition.

Technology Areas

  • AI & ML
  • AI & ML - Neural Networks
  • Space