Fast Algorithms on Imperfect, Heterogeneous, Distributed Data
Abstract
This project studies the design and implementation of several variants of NMF for text, graph and hybrid data analytics. It addresses challenges including solving new data analytics problems and improving the scalability of existing NMF algorithms. There are two major types of matrix representation of data: feature-data matrix and similarity matrix. Previous work showed successful application of standard NMF for feature-data matrix to areas such as text mining and image analysis, and Symmetric NMF (SymNMF) for similarity matrix to areas such as graph clustering and community detection. In this work, a divide-and-conquer strategy is applied to both methods to improve their time complexity from cubic growth with respect to the reduced low rank to linear growth, resulting in DC-NMF and HierSymNMF2 methods. Extensive experiments on large scale real world data show improved performance of these two methods. Furthermore, in this work NMF and SymNMF are combined into one formulation called Joint-NMF, to analyze hybrid data that contains both text content and connection structure information. They developed an open source software called SmallK (smallk.github.io) which offers several variants of NMF for fast clustering and topic modeling.
Document Details
- Document Type
- Technical Report
- Publication Date
- Aug 01, 2018
- Accession Number
- AD1057538
Entities
People
- Haesun Park
Organizations
- Georgia Tech Research Corporation