Low-Bitrate Speech Compression with a Glottal Pulse Autoencoder

Abstract

In this report, a convolutional neural network with a custom voicing layer is used to compress English speech to a rate of 1500 bits per second. The model consists of an autoencoder that converts speech input into a quantized latent space and then decodes with voice-like glottal sound and noise layers multiplied by a formant mask. The input and output of the model are magnitude short-time Fourier transform spectrograms. Each 32-ms frame of the input is approximately mapped to an element of the bottleneck layer and quantized to 48bits via 4 additive layers of 12-bit vector codebooks.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 16, 2024
Accession Number
AD1218935

Entities

People

  • Michael S. Lee

Organizations

  • United States Army Research Laboratory

Tags

DTIC Thesaurus Topics

  • Artificial Intelligence Software
  • Automated Speech Recognition
  • Coding
  • Communication Systems
  • Computational Science
  • Computer Programming
  • Computer Vision
  • Data Compression
  • Dimensionality Reduction
  • Frequency Bands
  • Intelligibility
  • Language
  • Neural Networks
  • Pattern Recognition
  • Signal Processing
  • Speech Compression
  • Two Dimensional

Fields of Study

  • Computer science

Readers

  • Approximation Theory.
  • Radio communications and signal processing.
  • Speech Processing/Speech Recognition.

Technology Areas

  • AI & ML
  • AI & ML - Neural Networks
  • Space