Low-Bitrate Speech Compression with a Glottal Pulse Autoencoder

Abstract

In this report, a convolutional neural network with a custom voicing layer is used to compress English speech to a rate of 1500 bits per second. The model consists of an autoencoder that converts speech input into a quantized latent space and then decodes with voice-like glottal sound and noise layers multiplied by a formant mask. The input and output of the model are magnitude short-time Fourier transform spectrograms. Each 32-ms frame of the input is approximately mapped to an element of the bottleneck layer and quantized to 48 bits via 4 additive layers of 12-bit vector codebooks.

Open PDF

Document Details

Document Type: Technical Report
Publication Date: Jan 17, 2024
Accession Number: AD1224751

Entities

People

Michael S. Lee

Organizations

United States Army Research Laboratory

Low-Bitrate Speech Compression with a Glottal Pulse Autoencoder

Abstract

Document Details

Entities

People

Organizations

Tags

Fields of Study

Readers

Technology Areas