The Unsupervised Acquisition of a Lexicon from Continuous Speech.

Abstract

We present an unsupervised learning algorithm that acquires a natural-language lexicon from raw speech. The algorithm is based on the optimal encoding of symbol sequences in an MDL framework, and uses a hierarchical representation of language that overcomes many of the problems that have stymied previous grammar-induction procedures. The forward mapping from symbol sequences to the speech stream is modeled using features based on articulatory gestures. We present results on the acquisition of lexicons and language models from raw speech, text, and phonetic transcripts, and demonstrate that our algorithm compares very favorably to other reported results with respect to segmentation performance and statistical efficiency.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Nov 01, 1995
Accession Number
ADA307187

Entities

People

  • Carl De Marcken

Organizations

  • Massachusetts Institute of Technology

Tags

Communities of Interest

  • Energy and Power Technologies

DTIC Thesaurus Topics

  • Algorithms
  • Artificial Intelligence
  • Artificial Intelligence Software
  • Automated Speech Recognition
  • Birds
  • Cognitive Science
  • Computational Science
  • Computer Programs
  • Engineering
  • Grammars
  • Information Theory
  • Language
  • Linguistics
  • Machine Learning
  • Machine Translation
  • Probability
  • United States

Fields of Study

  • Computer science

Readers

  • Computational Linguistics
  • Neural Network Machine Learning.
  • Speech Processing/Speech Recognition.

Technology Areas

  • AI & ML
  • AI & ML - Machine Translation