Foundations of Deep Learning: SGD and Beyond

Abstract

Project Abstract (Approved for Public Release)The central question in modern machine learning theory is the generalization capabilit y of overparametrized models trained by stochastic gradient descent (SGD). Overparameterization ensures that SGD can minimize the t raining loss; however, this expressiveness also ensures the existence of poor global minimizers that fail to generalize. This propos al will study which global minimizers are selected by SGD, whether they generalize, and whether we can improve generalization by mod ifying the training algorithm.Recent work identifies the stochasticity of SGD as a key factor in explaining the generalization of ov erparameterized models. It is also known that generalization is greatly affected by the choice of hyper-parameters, including learni ng rate, batch size, momentum, label noise, and dropout. The focus of this proposal is understanding how the stochasticity of SGD re gularizes training, how this process is affected by different hyper-parameters, and whether we can replace the implicit regularizati on of SGD with explicit regularization to further aid generalization.Intellectual Merit. This proposal lays out a program to study t he implicit regularization of SGD in deep learning in several thrusts of increasing generality.Thrust 1 Implicit Regularization of S GD with Label Noise. Motivated by empirical studies that demonstrate the effectiveness of noise induced by noisy labels, we study th e global implicit regularization of SGD with label noise. We show that SGD with label noise converges to a stationary point of a reg ularized loss, where the regularizer measures the Lipschitzness of the network. We conjecture that controlling this regularizer dire ctly controls generalization error and improves sample complexity and robustness.Thrust 2 Implicit Regularization of General SGD. Th e implicit regularization of stochastic gradient descent is greatly affected by the noise distribution. This thrust will develop a g eneral framework tocapture the implicit regularization of SGD by studying the interaction of the noise distribution with the model a rchitecture.Thrust 3 Beyond SGD: Disentangling Optimization and Generalization. Achieving state-of-the-art performance with SGD requ ires carefully tuning hyper-parameters including momentum, batch-size, learning rate, weight decay, dropout, and normalization layer . This thrust seeks to disentangle optimization and generalization, and thus simplifying both algorithm and regularizer design.Broad er Impacts. This proposal seeks to understand and capture this implicit regularization to pave the way for faster training algorithm s that generalize with larger batch sizes and fewer data points. Disentangling optimization and generalization is critically importa nt because it allows training algorithms that converge faster than SGD to generalize. Such explicit regularization may also allow fo r increased performance on small-data tasks and data-scarce regimes when SGD noise is not sufficient to ensure generalization and st ronger regularization is needed. As this analysis is not model specific, it will impact all naval machine learning applications, wit h a particular focus on faster convergence, better generalization, and reduced sample com

Document Details

Document Type
DoD Grant Award
Publication Date
Aug 20, 2021
Source ID
N000142112775

Entities

People

  • Jason S. Lee

Organizations

  • Office of Naval Research
  • Trustees of Princeton University
  • United States Navy

Tags

Fields of Study

  • Computer science

Readers

  • Neural Network Machine Learning.

Technology Areas

  • AI & ML
  • AI & ML - Machine Learning Algorithms
  • AI & ML - Neural Networks