Web resources for language modeling in conversational speech recognition

Abstract

This article describes a methodology for collecting text from the Web to match a target sublanguage both in style (register) and topic. Unlike other work that estimates n-gram statistics from page counts, the approach here is to select and filter documents, which provides more control over the type of material contributing to the n-gram counts. The data can be used in a variety of ways; here, the different sources are combined in two types of mixture models. Focusing on conversational speech where data collection can be quite costly, experiments demonstrate the positive impact of Web collections on several tasks with varying amounts of data, including Mandarin and English telephone conversations and English meetings and lectures.

Document Details

Document Type: Pub Defense Publication
Publication Date: Dec 01, 2007
Source ID: 10.1145/1322391.1322392

Entities

People

Andreas Stolcke
Ivan Bulyko
Manhung Siu
Mari Ostendorf
Tim Ng
Özgür Çetin

Organizations

BBN Technologies
Defense Advanced Research Projects Agency
Hong Kong University of Science and Technology
International Computer Science Institute
University of Washington

Web resources for language modeling in conversational speech recognition

Abstract

Document Details

Entities

People

Organizations

Tags

Readers

Technology Areas