Data Mining of Extremely Large Ad Hoc Data Sets to Produce Inverted Indices

Abstract

The purpose of this study is to leverage existing Internet-sized ad hoc data sets by creating an inverted index that will enable a robust search capability. In particular, this study is focused on the Common Crawl web corpus. This involves exploring the tools and techniques necessary to effectively traverse this data set, as well as producing the tools to create an inverted index relationship between the terms and websites found within web archive files. The primary tools utilized in this process are Apache Hadoop, Apache MapReduce, Amazon Web Services, and Java. Additionally, methods to enhance this relationship with other information of interest are investigated in this thesis. Specifically, an index was developed that contains the added component of term relative location. This inverted index relationship is an essential component ofand the first step increating a robust search capability for a very large ad hoc data set.

Open PDF

Document Details

Document Type: Technical Report
Publication Date: Jun 01, 2016
Accession Number: AD1026303

Entities

People

Aaron D. Coudray

Organizations

Naval Postgraduate School

Data Mining of Extremely Large Ad Hoc Data Sets to Produce Inverted Indices

Abstract

Document Details

Entities

People

Organizations

Tags

Communities of Interest

DTIC Thesaurus Topics

Readers

Technology Areas