Lucene for n-grams using the ClueWeb Collection

Abstract

The ARSC team made modifications to the Apache Lucene engine to accommodate "go words," taken from the Google Gigaword vocabulary of n-grams. Indexing the Category "B" subset of the ClueWeb collection was accomplished by a divide and conquer method, working across the separate ClueWeb subsets for 1, 2 and 3-grams.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Nov 01, 2009
Accession Number
ADA517732

Entities

People

  • Chris Fallen
  • Gregory B. Newby
  • Kylie Mccormick

Organizations

  • University of Alaska Anchorage

Tags

DTIC Thesaurus Topics

  • Abstracts
  • Arctic Regions
  • Availability
  • Classification
  • Contracts
  • Deficiencies
  • Information Operations
  • Instructions
  • Maryland
  • Monitoring
  • Regions
  • Security
  • Standards
  • Universities
  • Vocabulary

Readers

  • Computational Linguistics
  • Database Systems and Applications
  • Military/Explosive Ordnance Disposal (EOD) Technology