Statistical Learning for the Modern Datasets: Generalization Bounds and Near-Optimal Learning Algorithms

Abstract

During the last two to three decades, several methods developed within machine learning have been able to come close to (or, in some instances, exceed) human-level performanceÑthe holy grail of AIÑon certain specialized tasks. Most of these developments have occurred in the context of deep learning, which requires large amounts of training data and sometimes necessitates a bit of ÔartÕ to achieve the desired results. There is also the issue of machine learning methods breaking down, sometimes with catastrophic effects, when they are moved from the ideal ÔlabÕ settings to the real-world settings. It is against this backdrop that three research questions become paramount in developing machine learning methods that can be deployed in real-world settings, especially for mission-critical applications that require a certain level of robustness and that sometimes also have limited number of training data samples: (1) A theoretical characterization of the generalization error (aka, expected loss) incurred by a machine learning algorithm on the unseen data in the real world; (2) Determination of the absolutely smallest number of data samples beyond which no algorithm can adequately learn from training data and ensure small generalization error; and (3) Development of algorithms whose generalization errors approach the fundamental limits on sample complexity. And while the statistical learning framework has long played a central role in advancing our understanding of machine learning systems, there is an interest in looking afresh at the questions of generalization error bounds, fundamental limits, and near-optimal algorithms in the face of modern datasets that increasingly represent a ÔzooÕ. Indeed, the classical statistical learning works typically focused on centralized datasets that often had Euclidean geometry. In contrast, many of todayÕs and tomorrowÕs applications of machine learning involve non-Euclidean datasets that are non-centralized, with data often streaming at very high rates, some of which might be compromised due to either gross errors or actions of adversarial entities. Such modern datasets necessitate development of fundamentally new analytical tools and algorithmic techniques for statistical learning-based study of machine learning systems. It is in this regard that this project leverages tools from stochastic approximation, (centralized and distributed) optimization theory, concentration-of-measure literature, information theory, robust statistics, and tensor algebra to derive generalization error bounds, fundamental limits on sample complexity, and near-optimal learning algorithms for machine learning from modern datasets. The outcomes of this project are expected to not only advance the state-of-the-art in statistical learning theory, but they are also expected to lead to computationally efficient algorithms for machine learning that can be deployed in practical settings with the smallest number of training samples.

Document Details

Document Type
DoD Grant Award
Publication Date
Jun 25, 2021
Source ID
W911NF2110301

Entities

People

  • Waheed U. Bajwa

Organizations

  • Army Contracting Command
  • Rutgers University
  • United States Army

Tags

Fields of Study

  • Computer science

Readers

  • Adaptive Control and Estimation with Uncertainty in Dynamic Systems.
  • Distributed Systems and Data Platform Development
  • Neural Network Machine Learning.

Technology Areas

  • AI & ML
  • AI & ML - Machine Learning Algorithms
  • AI & ML - Neural Networks