Spontaneous Speech Collection for the CSR Corpus

Abstract

As part of a pilot data collection for DARPA's Continuous Speech Recognition (CSR) speech corpus, SRI International experimented with the collection of spontaneous speech material. The bulk of the CSR pilot data was read versions of news articles from the Wall Street Journal (WSJ), and the spontaneous sentences were to be similar material, but spontaneously dictated. In the first pilot portion of the data collection, twelve subjects including nine journalists were located, and instructed in how to dictate using the data collection hardware and software at SRI. These talkers pro- produced 1280 spontaneous sentences. In general, compared to read material, the spontaneous material took about two to three times more subject time to produce and about four times more experimenter time to produce, package, and ship. The paper provides details on the materials, subjects and procedures used in the study, and it describes the results in terms of speaker reaction and data production. The methods described are sufficient to collect fluent spontaneous recordings at a predictable rate. The spontaneous material differs in several characteristics from WSJ material; paragraphs and sentences tend to be longer, more world type are used, and by most measures, the material is more variable.

Open PDF

Document Details

Document Type: Technical Report
Publication Date: Jan 01, 1992
Accession Number: ADA457876

Entities

People

Denise Danielson
Jared Bernstein

Organizations

SRI International

Spontaneous Speech Collection for the CSR Corpus

Abstract

Document Details

Entities

People

Organizations

Tags

DTIC Thesaurus Topics

Readers

Technology Areas