Data Analysis Project: Leveraging Massive Textual Corpora Using n-Gram Statistics

Abstract

We study methods of efficiently leveraging massive textual corpora through n-gram statistics. Specifically, we explore algorithms that use a database of frequency counts for sequences of tokens in a teraword Web corpus to correct spelling mistakes and to extract a list of instances of some category given only the name of the target category. For spelling correction, we use a novel correction algorithm and demonstrate high accuracy in correcting both real-word errors and non-word errors. For category extraction, we show promising preliminary results for a variety of categories. We conclude that n-gram statistics provide an efficient way to use information contained in a massive corpus of text using relatively simple algorithms. The report ends with a reflection on problems encountered, possible solutions, and future work.

Open PDF

Document Details

Document Type: Technical Report
Publication Date: May 01, 2008
Accession Number: ADA486165

Entities

People

Andrew Carlson
Ian Fette
Tom M. Mitchell

Organizations

Carnegie Mellon University

Data Analysis Project: Leveraging Massive Textual Corpora Using n-Gram Statistics

Abstract

Document Details

Entities

People

Organizations

Tags

Communities of Interest

DTIC Thesaurus Topics

Fields of Study

Readers