Sublinear estimation of entropy and information distances
Abstract
In many data mining and machine learning problems, the data items that need to be clustered or classified are not arbitrary points in a high-dimensional space, but are distributions, that is, points on a high-dimensional simplex. For distributions, natural measures are not ℓ p distances, but information-theoretic measures such as the Kullback-Leibler and Hellinger divergences. Similarly, quantities such as the entropy of a distribution are more natural than frequency moments. Efficient estimation of these quantities is a key component in algorithms for manipulating distributions. Since the datasets involved are typically massive, these algorithms need to have only sublinear complexity in order to be feasible in practice.
Document Details
- Document Type
- Pub Defense Publication
- Publication Date
- Oct 01, 2009
- Source ID
- 10.1145/1597036.1597038
Entities
People
- Andrew Mcgregor
- Sudipto Guha
- Suresh Venkatasubramanian
Organizations
- Division of Computing and Communication Foundations
- Office of Naval Research
- University of Massachusetts
- University of Pennsylvania
- University of Utah