Data Mining of Extremely Large Ad-Hoc Data Sets to Produce Reverse Web-Link Graphs

Abstract

Data mining can be a valuable tool, particularly in the acquisition of military intelligence. As the second study within a larger NavalPostgraduate School research project using Amazon Web Services (AWS), this thesis focuses on data mining on a very large dataset (32 TB) with the open web crawler data set Common Crawl. Similar to previous studies, this research employs MapReduce(MR) for sorting and categorizing output value pairs. Our research, however, is the first to implement the basic Reverse Web-LinkGraph (RWLG) algorithm as a search capability for web sites, with validation that it works correctly. A second goal is to extend theRWLG algorithm using a full Common Crawl archive as input for processing as a single MR job. To mitigate the out-of-memory error,we relate some environment variables with the Yet Another Resource Negotiator (YARN) architecture and provide some sampleerror tracking methods. As a further contribution, this study considers limitations associated with using AWS, which inform ourrecommendations for future work.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Mar 01, 2017
Accession Number
AD1045810

Entities

People

  • Tao-hsiang Chang

Organizations

  • Naval Postgraduate School

Tags

Communities of Interest

  • Engineered Resilient Systems
  • Materials and Manufacturing Processes

DTIC Thesaurus Topics

  • Big Data
  • Cloud Computing
  • Computer Networks
  • Computer Programming
  • Computer Programs
  • Computer Science
  • Computers
  • Data Mining
  • Electronic Mail
  • Html
  • Internet
  • Network Protocols
  • Network Science
  • Operating Systems
  • Parallel Computing
  • Social Media
  • Websites

Fields of Study

  • Computer science

Readers

  • Agent-Based Social Robotics and Mobile-Assisted Learning in Virtual Environments.
  • Atmospheric Science/Meteorology
  • Information Retrieval

Technology Areas

  • AI & ML