Data Analysis Project: Leveraging Massive Textual Corpora Using n-Gram Statistics
Abstract
We study methods of efficiently leveraging massive textual corpora through n-gram statistics. Specifically, we explore algorithms that use a database of frequency counts for sequences of tokens in a teraword Web corpus to correct spelling mistakes and to extract a list of instances of some category given only the name of the target category. For spelling correction, we use a novel correction algorithm and demonstrate high accuracy in correcting both real-word errors and non-word errors. For category extraction, we show promising preliminary results for a variety of categories. We conclude that n-gram statistics provide an efficient way to use information contained in a massive corpus of text using relatively simple algorithms. The report ends with a reflection on problems encountered, possible solutions, and future work.
Document Details
- Document Type
- Technical Report
- Publication Date
- May 01, 2008
- Accession Number
- ADA486165
Entities
People
- Andrew Carlson
- Ian Fette
- Tom M. Mitchell
Organizations
- Carnegie Mellon University