Fast Algorithms on Imperfect, Heterogeneous, Distributed Data

Abstract

This project studies the design and implementation of several variants of NMF for text, graph and hybrid data analytics. It addresses challenges including solving new data analytics problems and improving the scalability of existing NMF algorithms. There are two major types of matrix representation of data: feature-data matrix and similarity matrix. Previous work showed successful application of standard NMF for feature-data matrix to areas such as text mining and image analysis, and Symmetric NMF (SymNMF) for similarity matrix to areas such as graph clustering and community detection. In this work, a divide-and-conquer strategy is applied to both methods to improve their time complexity from cubic growth with respect to the reduced low rank to linear growth, resulting in DC-NMF and HierSymNMF2 methods. Extensive experiments on large scale real world data show improved performance of these two methods. Furthermore, in this work NMF and SymNMF are combined into one formulation called Joint-NMF, to analyze hybrid data that contains both text content and connection structure information. They developed an open source software called SmallK (smallk.github.io) which offers several variants of NMF for fast clustering and topic modeling.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Aug 01, 2018
Accession Number
AD1057538

Entities

People

  • Haesun Park

Organizations

  • Georgia Tech Research Corporation

Tags

Communities of Interest

  • Biomedical
  • Cyber
  • Energy and Power Technologies

DTIC Thesaurus Topics

  • Air Force
  • Air Force Research Laboratories
  • Algorithms
  • Artificial Intelligence
  • Big Data
  • Computer Programs
  • Computer Science
  • Computers
  • Data Analysis
  • Data Mining
  • Detection
  • Information Science
  • Network Science
  • Open Source Software
  • Social Media
  • Standards
  • Text Mining

Fields of Study

  • Computer science

Readers

  • Graph Algorithms and Convex Optimization.
  • Neural Network Machine Learning.