Cluster Computing For Automated Network Analysis At Scale

Abstract

Conventional single node packet analyzers are unable to monitor network traffic at scale. In this thesis, elements of the Apache Hadoop ecosystem, including HBase, Spark, and MapReduce, are employed to conduct network traffic analysis on a large collection of network traffic. Limited analysis is conducted directly on packet capture next generation (pcapng) files on the Hadoop Distributed File System (HDFS) using MapReduce. Next, to allow for repeated analysis on the same dataset without reading all source files in their entirety for every calculation, pcapng files are parsed and relevant meta-data is bulk loaded into HBase, a Not Only Structured Query Language (NoSQL) database employing the HDFS for parallelization. This NoSQL database is then accessed via Apache Spark where pertinent data is loaded into DataFrames and additional analysis on the network traffic takes place. This research demonstrates the viability of custom, modular, automated analytics, employing open-source software to enable parallelization, to conduct traffic analysis at scale.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jun 01, 2018
Accession Number
AD1059771

Entities

People

  • Benjamin J. Brida

Organizations

  • Naval Postgraduate School

Tags

Communities of Interest

  • Engineered Resilient Systems

DTIC Thesaurus Topics

  • Algorithms
  • Big Data
  • Computer Network Security
  • Computer Programming
  • Computer Programs
  • Computers
  • Data Analysis
  • Data Management
  • Data Processing
  • Data Set
  • Data Sets
  • Data Storage Systems
  • Databases
  • Digital Data
  • Domain Specific Programming Languages
  • Ecosystems
  • Environment
  • Information Science
  • Language
  • Machine Learning
  • Network Protocols
  • Open Source Software
  • Operating Systems
  • Programming Languages
  • Standards
  • Supervised Machine Learning
  • Virtual Machines

Fields of Study

  • Computer science

Readers

  • Database Systems and Applications
  • Distributed Systems and Data Platform Development