Fast Algorithms on Imperfect, Heterogeneous, Distributed Data

Abstract

This project studies the design and implementation of several variants of NMF for text, graph and hybrid data analytics. It addresses challenges including solving new data analytics problems and improving the scalability of existing NMF algorithms. There are two major types of matrix representation of data: feature-data matrix and similarity matrix. Previous work showed successful application of standard NMF for feature-data matrix to areas such as text mining and image analysis, and Symmetric NMF (SymNMF) for similarity matrix to areas such as graph clustering and community detection. In this work, a divide-and-conquer strategy is applied to both methods to improve their time complexity from cubic growth with respect to the reduced low rank to linear growth, resulting in DC-NMF and HierSymNMF2 methods. Extensive experiments on large scale real world data show improved performance of these two methods. Furthermore, in this work NMF and SymNMF are combined into one formulation called Joint-NMF, to analyze hybrid data that contains both text content and connection structure information. They developed an open source software called SmallK (smallk.github.io) which offers several variants of NMF for fast clustering and topic modeling.

Open PDF

Document Details

Document Type: Technical Report
Publication Date: Aug 01, 2018
Accession Number: AD1057538

Entities

People

Haesun Park

Organizations

Georgia Tech Research Corporation

Fast Algorithms on Imperfect, Heterogeneous, Distributed Data

Abstract

Document Details

Entities

People

Organizations

Tags

Communities of Interest

DTIC Thesaurus Topics

Fields of Study

Readers