Spontaneous Speech Collection for the CSR Corpus

Abstract

As part of a pilot data collection for DARPA's Continuous Speech Recognition (CSR) speech corpus, SRI International experimented with the collection of spontaneous speech material. The bulk of the CSR pilot data was read versions of news articles from the Wall Street Journal (WSJ), and the spontaneous sentences were to be similar material, but spontaneously dictated. In the first pilot portion of the data collection, twelve subjects including nine journalists were located, and instructed in how to dictate using the data collection hardware and software at SRI. These talkers pro- produced 1280 spontaneous sentences. In general, compared to read material, the spontaneous material took about two to three times more subject time to produce and about four times more experimenter time to produce, package, and ship. The paper provides details on the materials, subjects and procedures used in the study, and it describes the results in terms of speaker reaction and data production. The methods described are sufficient to collect fluent spontaneous recordings at a predictable rate. The spontaneous material differs in several characteristics from WSJ material; paragraphs and sentences tend to be longer, more world type are used, and by most measures, the material is more variable.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 1992
Accession Number
ADA457876

Entities

People

  • Denise Danielson
  • Jared Bernstein

Organizations

  • SRI International

Tags

DTIC Thesaurus Topics

  • Abstracts
  • Automated Speech Recognition
  • Contracts
  • Feedback
  • Governments
  • Information Operations
  • Instructions
  • Materials
  • Newspapers
  • Pilot Studies
  • Production
  • Public Relations
  • Standards
  • User Interface
  • Vocabulary
  • Word Processors

Readers

  • Speech Processing/Speech Recognition.
  • Systems Analysis and Design

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval