YIP-DREAMI: Dimension Reduction for Efficient Automated Machine Intelligenceure

Abstract

This proposal seeks to address new challenges of signal processing in the big data era. Data is becoming cheaper and easier to collect and store due to a range of technological advancements. The resulting big datasets appear throughout applications in science andtechnology, healthcare, government, and defense. These large collections of data pose a problem for traditional data processing algorithms, as they are likely to be highly heterogeneous, with data of many different types and on many different scales, with missingvalues and with corruptions. As the volume of data grows faster than the number of data analysts or theorists, we require automatedmethods to process and make sense of this data. Dimension reduction provides the key to exploiting large datasets. The key idea isthat measurements of a complex object, such as a patient in a hospital, respondent on a survey, or even an entire dataset, can be well described as simple functions of an underlying low dimensional latent vector. When we can identify these low dimensional latent vectors and these simple measurement functions, we can denoise and impute entries in messy datasets, reduce the dimensionality of feature vectors, and simplify any further data processing. The first part (Tasks 1 & 2) of this proposal develops new mathematical tools to reduce the dimension of big messy datasets. We will develop new theory to understand when and why low rank models work well, and theoretically justified methods with provable performance guarantees to estimate the parameters of a more general latent variable model when linear models do not suffice. We will devise efficient algorithms with theoretical guarantees to learn nonlinear functions of the original variable, focusing on elementwise or polynomial transformations of the low dimensional variable. The second part (Tasks 3 & 4) uses the dimension reduction techniques developed in the first two tasks to choose models and formulate theories to understand a given dataset. Task 3 uses dimensionality reduction for automated machine learning (AutoML): to find the best algorithmand hyperparameter setting for a given supervised learning problem. Our AutoML methods will use dimension reduction to localize similar datasets near each other in a low dimensional space using techniques developed in Tasks 1 and 2, so that nearness in this spacepredicts similar performance of machine learning methods. Here, low dimensional structure is the key to speed. A lower dimensional space leads to faster information acquisition (and hence, prediction). Task 4 seeks to learn a model for the data-generating process itself: to automate the work of theorists, rather than of data analysts. We will focus on PDE models for the physical world. The first step in our approach will be to model the experimental data using a simulation of the nominal PDE with an additional forcing term that will be learned, We plan to use the dimension reduction techniques developed in Tasks 1 and 2 to solve these optimization problems more efficiently and faster. The second step is to view this forcing term as endogenous: we will learn a parameterized map from the state of the system to this forcing term. Here, we expect to rely on automated machine learning techniques developed in Task 3. If successful, this work will ease challenges of big data, through robust and scalable algorithms, and of small data, by reducing the dimension of the feature space. Moreover, it presents a principled, unified approach to multi-modal, multi-scale information integration, significantly enhancing DoD capabilities. Sep- arately, these tasks develop new techniques for use in data cleaning, featurization, machine learning, and PDE modeling. Taken together, this work provides a new automated toolkit to derive insights from data

Document Details

Document Type
DoD Grant Award
Publication Date
Feb 06, 2023
Source ID
N000142312203

Entities

People

  • Madeleine Udell

Organizations

  • Office of Naval Research
  • Stanford University
  • United States Navy

Tags

Fields of Study

  • Computer science

Readers

  • Calculus or Mathematical Analysis
  • Distributed Systems and Data Platform Development
  • Neural Network Machine Learning.

Technology Areas

  • AI & ML
  • AI & ML - Machine Learning Algorithms
  • Space