Data Analysis Project: Leveraging Massive Textual Corpora Using n-Gram Statistics

Abstract

We study methods of efficiently leveraging massive textual corpora through n-gram statistics. Specifically, we explore algorithms that use a database of frequency counts for sequences of tokens in a teraword Web corpus to correct spelling mistakes and to extract a list of instances of some category given only the name of the target category. For spelling correction, we use a novel correction algorithm and demonstrate high accuracy in correcting both real-word errors and non-word errors. For category extraction, we show promising preliminary results for a variety of categories. We conclude that n-gram statistics provide an efficient way to use information contained in a massive corpus of text using relatively simple algorithms. The report ends with a reflection on problems encountered, possible solutions, and future work.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
May 01, 2008
Accession Number
ADA486165

Entities

People

  • Andrew Carlson
  • Ian Fette
  • Tom M. Mitchell

Organizations

  • Carnegie Mellon University

Tags

Communities of Interest

  • Autonomy
  • C4I
  • Cyber

DTIC Thesaurus Topics

  • Accuracy
  • Anti-Bacterial Agents
  • Bacteria
  • Birds
  • Computer Science
  • Data Analysis
  • Databases
  • Electronic Mail
  • Errors
  • Frequency
  • Fungi
  • Geography
  • Infection
  • Machine Learning
  • Natural Language Processing
  • Network Science
  • Probabilistic Models

Fields of Study

  • Computer science

Readers

  • Computational Linguistics
  • Educational Psychology