Sublinear estimation of entropy and information distances

Abstract

In many data mining and machine learning problems, the data items that need to be clustered or classified are not arbitrary points in a high-dimensional space, but are distributions, that is, points on a high-dimensional simplex. For distributions, natural measures are not ℓ p distances, but information-theoretic measures such as the Kullback-Leibler and Hellinger divergences. Similarly, quantities such as the entropy of a distribution are more natural than frequency moments. Efficient estimation of these quantities is a key component in algorithms for manipulating distributions. Since the datasets involved are typically massive, these algorithms need to have only sublinear complexity in order to be feasible in practice.

Document Details

Document Type
Pub Defense Publication
Publication Date
Oct 01, 2009
Source ID
10.1145/1597036.1597038

Entities

People

  • Andrew Mcgregor
  • Sudipto Guha
  • Suresh Venkatasubramanian

Organizations

  • Division of Computing and Communication Foundations
  • Office of Naval Research
  • University of Massachusetts
  • University of Pennsylvania
  • University of Utah

Tags

Fields of Study

  • Computer science

Readers

  • Distributed Systems and Data Platform Development
  • Mathematical Modeling and Probability Theory.
  • Regression Analysis.

Technology Areas

  • AI & ML
  • AI & ML - Bayesian Inference
  • AI & ML - Machine Learning Algorithms
  • Space