Leveraging Foundational Models for Reliable Discovery and Inference

Abstract

Approved for Public Release. Reliable decision-making increasingly depends on high-quality data. A burden is that the acquisition of quality labels often involves laborious human annotations or slow and expensive scientific measurements. Generative AI and/or machine learning is becoming an appealing alternative as sophisticated predictive techniques are being used to quickly and cheaply produce large amounts of imputed data; e.g., predicted protein structures are used to supplement experimentally derived structures, predictions of socioeconomic indicators from satellite imagery are used to supplement accurate survey data, and so on. Since predictions are imperfect and potentially biased, this practice brings into question the validity of downstream inferences.This project will leverage foundational models to draw rigorous inferences about populations and to make new individual discoveries. This shall be achieved by introducing broadly applicable inferential tools that can work with essentially any machine learning prediction or imputation model or any generative AI model. Throughout, this project will address a major challenge to make this possible, namely, how do we remove the bias introduced by the generative AI model?Our project has four aims. The first aim shall develop methods to effectively augment the sample size by using imputations from machine learning models while automatically correcting for biases. The promise is that the resulting inferences achieve specified target error probabilities and that generative AI boosts performance, i.e., that the resulting inferences are more powerful than if we had only used the real data. The second aim shall develop methods for active statistical inference, whereby the machine learning model is used to select which observations to collect. Again, the promise is that an AI-powered strategic data collection leads to an improvement in statistical accuracy in the sense that we can achieve a reduction inthe required samplesize to answer statistical questions in a variety of domains. The third aim shall develop methods for reliable and diverse discoveries (e.g. drugs), whereby the machine learning model is used to screen candidates and select those that are most likely to be interesting (e.g. active for a specific target molecule) before we actually measure this experimentally. The promise ishere is to be able to do so with a high hit rate so that we do not waste resources on costly follow-up investigation, and to prioritize diversity so that the selected units do not look all the same (e.g. the compounds we identify are not too similar to each otheror to known drugs). Finally, the fourth aim shall develop methods for active drug discovery, whereby the machine learning model is used to select which drugs to evaluate next, and improve coverage of the chemical space.AI or machine learning algorithms/tools are part of large and complex systems used by the Navy. These systems routinely evaluate concrete situations and inform important decisions and actions. When going from data to action, it is imperative to prevent the occurrence of hallucinations or simply, biases. Thetools from this proposal address such critical issues. Also, the proposed effort on active drug discovery may speed up and prevent failures in the drug discovery pipeline, and ultimately translate to improvement in health outcomes.

Document Details

Document Type
DoD Grant Award
Publication Date
Apr 11, 2024
Source ID
N000142412305

Entities

People

  • Emmanuel Candès

Organizations

  • Office of Naval Research
  • Stanford University
  • United States Navy

Tags

Readers

  • Distributed Systems and Data Platform Development
  • Neural Network Machine Learning.
  • Regression Analysis.

Technology Areas

  • AI & ML
  • AI & ML - Bayesian Inference
  • AI & ML - DoD AI Strategy
  • AI & ML - Neural Networks
  • Space