Lucene for n-grams using the ClueWeb Collection
Abstract
The ARSC team made modifications to the Apache Lucene engine to accommodate "go words," taken from the Google Gigaword vocabulary of n-grams. Indexing the Category "B" subset of the ClueWeb collection was accomplished by a divide and conquer method, working across the separate ClueWeb subsets for 1, 2 and 3-grams.
Document Details
- Document Type
- Technical Report
- Publication Date
- Nov 01, 2009
- Accession Number
- ADA517732
Entities
People
- Chris Fallen
- Gregory B. Newby
- Kylie Mccormick
Organizations
- University of Alaska Anchorage