Parallel Processing with TreeClust

Abstract

Clustering data is one of the most common statistical and machine learning techniques for analyzing big data. Clustering can be particularly difficult when the data sets include categorical, missing, or noise variables. The tree clustering algorithm developed by Samuel Buttrey and Lyn Whitaker, as described in the December 2015 issue of The R Journal, seems to provide a solution to these problems, but it requires a large set of overhead computations. This issue is intensified when working with high-dimensional data because the extent of treeClusts overhead computations are based on the dimensions of the data. High performance computing (HPC) and parallel processing present a solution to this overhead computation burden, but treeClusts existing parallel processing method does not work on the Naval Postgraduate Schools HPC, the Hamming Supercomputer (HSC). Furthermore, correctly determining what HPC resources to use can be a difficult task. In this thesis, we present a new HSC-specific method for parallel processing data using the treeClust R package developed by Buttrey and Whitaker. Based on the results of our experiments, our method approximates the optimal resource HPC request, so that users realize the best run time when using treeClust on the HSC.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Sep 01, 2017
Accession Number
AD1046886

Entities

People

  • I. Taylor Mckechnie

Organizations

  • Naval Postgraduate School

Tags

Communities of Interest

  • Autonomy

DTIC Thesaurus Topics

  • Algorithms
  • Central Processing Units
  • Computations
  • Computer Architecture
  • Computer Programming
  • Computer Programs
  • Computer Science
  • Computers
  • Data Sets
  • Graphics Processing Unit
  • High Performance Computing
  • Operating Systems
  • Parallel Computing
  • Parallel Processing
  • Shell Scripts
  • Smartphones
  • Supercomputers

Fields of Study

  • Computer science

Readers

  • Computational Fluid Dynamics (CFD)
  • Neural Network Machine Learning.
  • Regression Analysis.

Technology Areas

  • AI & ML
  • AI & ML - Machine Learning Algorithms