Improving Trigram Language Modeling with the World Wide Web

Abstract

We propose a novel method for using the World Wide Web to acquire trigram estimates for statistical language modeling We submit an N-gram as a phrase query to web search engines. The search engines return the number of web pages containing the phrase, from which the N-gram count is estimated. The N-gram counts are then used to form web-based trigram probability estimates. We discuss the properties of such estimates, and methods to interpolate them with traditional corpus based trigram estimates. We show that the interpolated models improve speech recognition word error rate significantly over a small test set.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Nov 01, 2000
Accession Number
ADA385124

Entities

People

  • Roni Rosenfeld
  • Xiaojin Zhu

Organizations

  • Carnegie Mellon University

Tags

Communities of Interest

  • Biomedical

DTIC Thesaurus Topics

  • Automated Speech Recognition
  • Computer Science
  • Context Free Grammars
  • Data Sets
  • Frequency
  • Gaussian Distributions
  • Hypotheses
  • Interpolation
  • Language
  • Probability
  • Recognition
  • Reliability
  • Test Sets
  • Training
  • Word Lists
  • World Wide Web

Fields of Study

  • Computer science

Readers

  • Computational Linguistics
  • Database Systems and Applications
  • Regression Analysis.

Technology Areas

  • AI & ML
  • AI & ML - Bayesian Inference
  • AI & ML - Information Retrieval
  • AI & ML - Machine Translation