Low-Bitrate Speech Compression with a Glottal Pulse Autoencoder
Abstract
In this report, a convolutional neural network with a custom voicing layer is used to compress English speech to a rate of 1500 bits per second. The model consists of an autoencoder that converts speech input into a quantized latent space and then decodes with voice-like glottal sound and noise layers multiplied by a formant mask. The input and output of the model are magnitude short-time Fourier transform spectrograms. Each 32-ms frame of the input is approximately mapped to an element of the bottleneck layer and quantized to 48 bits via 4 additive layers of 12-bit vector codebooks.
Document Details
- Document Type
- Technical Report
- Publication Date
- Jan 17, 2024
- Accession Number
- AD1224751
Entities
People
- Michael S. Lee
Organizations
- United States Army Research Laboratory