Getting More from Less: Optimal Estimation and Learning, For Sparse, High Dimensional, or Untrusted Data.
Abstract
Project Title:Getting More from Less: Optimal Estimation and Learning, for Sparse, High Dimensional, or Untrusted Data.Abstract for SOW:The goals of the proposed research are to develop computationally efficient and information theoretically optimal algorithms and estimators for a variety of fundamental problems. The proposed problems center on two theoreticallyrich and increasingly important broad challenges: 1) extracting accurate information from complex distributions given relatively sparse samples, and2) obtaining provably accurate estimation and learning algorithms that can be applied to datasets where some (unknown) portion of the data is biased, untrusted, or maliciously generated. The focus of the first portion of the proposed work is motivated by the many practical settings of interest for which theempirical distribution of the available data is a poor representation of the underlying distribution or phenomenon. Such settings are pervasive, and include natural images and language, as well as medical and genomic datasets???essentially any data drawn from heavy-tailed distributions, high dimensional distributions, or distributions with complex dependencies. In such settings, what properties of the underlying distribution can be accurately inferred? Building offrecent work of PI Valiant, the proposed research seeks to answer this question in a number of fundamental settings. These include settings where the dimensionality of the distribution in question is high in comparison to the dataset size, settings with data drawn from multiple distributions where the goal is to infer relationships between the distribution (even in the regime where there is too little data to learn the distributions), and settings involving distributions ofsequential or time-dependent data including those generated by Hidden Markov Models. The second portion of the proposed work is motivated by the observation that the vast majority of theoretical results in machine learning andstatistics assume that that the training data is a reasonably reliable reflection of the phenomena to be learned or estimated. Similarly, the majority of machine learning and statistical techniques used in practice are brittle to thepresence of large amounts of biased or malicious data. This portion of the proposed work seeks to develop algorithms for a number of basic estimation, learning, and optimization tasks, that are robust to the presence of significant fractions of biased, erroneous, arbitrary, or even adversarially generated data; the proposal offers two novel theoretical frameworks in which to investigate the computational and information theoretic properties of these questions. This work is likely to have implications for several important practical settings including the curation of large datasets constructed with the help of unreliable workers (e.g. crowd-sourced workers), recommendation systems built on feedback/opinions including those contributed by untrusted or malicious participants, and providing security/performance guarantees for machine learning systems that have been trained on datasets that may havesome fraction of data provided by unreliable devices, or malevolent parties.
Document Details
- Document Type
- DoD Grant Award
- Publication Date
- Jul 10, 2018
- Source ID
- N000141812295
Entities
People
- Gregory Valiant
Organizations
- Office of Naval Research
- Stanford University
- United States Navy