Fully Automatic Cross-Associations

Abstract

Large, sparse binary matrices arise in numerous data mining applications, such as the analysis of market baskets, web graphs, social networks, co-citations, as well as information retrieval, collaborative filtering, sparse matrix reordering, etc. Virtually all popular methods for the analysis of such matrices e.g., k-means clustering, METIS graph partitioning, SVD/PCA and frequent itemset mining require the user to specify various parameters, such as the number of clusters, number of principal components, number of partitions, and support. Choosing suitable values for such parameters is a challenging problem. Cross-association is a joint decomposition of a binary matrix into disjoint row and column groups such that the rectangular intersections of groups are homogeneous. Starting from first principles, we furnish a clear, information theoretic criterion to choose a good cross-association as well as its parameters, namely, the number of row and column groups. We provide scalable algorithms to approach the optimal. Our algorithm is parameter-free, and requires no user intervention. In practice it scales linearly with the problem size, and is thus applicable to very large matrices. Finally, we present experiments on multiple synthetic and real-life datasets, where our method gives high-quality, intuitive results.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Aug 01, 2004
Accession Number
ADA459025

Entities

People

  • Christos Fabloutsos
  • Deepayan Chakrabarti
  • Dharmendra S. Modha
  • Spiros Papadimitriou

Organizations

  • Carnegie Mellon University

Tags

Communities of Interest

  • Biomedical

DTIC Thesaurus Topics

  • Abstracts
  • Algorithms
  • Artificial Intelligence
  • Automatic
  • Coding
  • Compression Ratio
  • Computer Programming
  • Computer Science
  • Cost Models
  • Data Compression
  • Data Mining
  • Information Retrieval
  • Information Science
  • Information Theory
  • Natural Language Processing
  • Probability Distributions
  • Social Networks

Fields of Study

  • Computer science

Readers

  • Distributed Systems and Data Platform Development
  • Linear Algebra
  • Neural Network Machine Learning.

Technology Areas

  • AI & ML
  • AI & ML - Machine Learning Algorithms