Data Mining of Extremely Large Ad Hoc Data Sets to Produce Inverted Indices

Abstract

The purpose of this study is to leverage existing Internet-sized ad hoc data sets by creating an inverted index that will enable a robust search capability. In particular, this study is focused on the Common Crawl web corpus. This involves exploring the tools and techniques necessary to effectively traverse this data set, as well as producing the tools to create an inverted index relationship between the terms and websites found within web archive files. The primary tools utilized in this process are Apache Hadoop, Apache MapReduce, Amazon Web Services, and Java. Additionally, methods to enhance this relationship with other information of interest are investigated in this thesis. Specifically, an index was developed that contains the added component of term relative location. This inverted index relationship is an essential component ofand the first step increating a robust search capability for a very large ad hoc data set.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jun 01, 2016
Accession Number
AD1026303

Entities

People

  • Aaron D. Coudray

Organizations

  • Naval Postgraduate School

Tags

Communities of Interest

  • Energy and Power Technologies

DTIC Thesaurus Topics

  • Application Software
  • Big Data
  • Computer Programming
  • Computer Programs
  • Computers
  • Data Analysis
  • Data Mining
  • Data Sets
  • Domain Specific Programming Languages
  • Html
  • Markup Languages
  • Network Protocols
  • Operating Systems
  • Programming Languages
  • United States
  • Web Service
  • Xml

Readers

  • Distributed Systems and Data Platform Development
  • Geospatial Intelligence and Artificial Intelligence Analytics
  • Regression Analysis.

Technology Areas

  • AI & ML