Applying Machine Learning with Model Interpretability for Sequence-Based Protein Solubility Prediction

Abstract

The prediction of protein solubility is essential for basic research on natural proteins but increasingly for production and investigation of engineered or designed proteins, where experimental confirmation of the engineered properties hinges on the ability to produce it. Thus, accurate predictions of protein solubility are widely sought after by protein engineers. Here we present a new approach which uses an extreme gradient boosting (XGBoost) algorithm fed by a variety of data sources including predicted solvent accessibility, secondary structure, among others, to predict solubility of proteins. Our model achieves a high level of performance using a standard hold-out test set, with an overall accuracy of 72%, among the highest for sequence-based machine learning models. Critically, our system also yields information on the features important for the predictions, making use of explainable artificial intelligence to provide both local and global explainers. Using this information, we found that the certain mono-, di-, and tri-peptides are strongly associated with solubility, as are metrics for protein disorder, relative solvent accessibility, and frequency of certain secondary structures, each of which are derived from other prediction models. Critically, in our graphical user interface for the model, we make use of local explanations to help inform the reasoning behind the predictions and suggest modifications. Our models accuracy paired with its interpretability should allow for rapid prediction of protein solubility, in particular for proteins and protein families without reliable structural information. This should greatly enhance our ability to experimentally produce and investigate proteins designed by machine learning-guided approaches and other protein engineering strategies.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Sep 16, 2022
Accession Number
AD1181125

Entities

People

  • Anthony P. Malanoski
  • Jerome A. Alvarez
  • Patricia M Legler
  • Scott N Dean

Organizations

  • United States Naval Research Laboratory

Tags

Communities of Interest

  • Autonomy

DTIC Thesaurus Topics

  • Accuracy
  • Amino Acids
  • Artificial Intelligence
  • Artificial Intelligence Software
  • Biomedical Information Systems
  • Buildings And Structures
  • Chemical Synthesis
  • Chemistry
  • Computational Biology
  • Computational Science
  • Deep Learning
  • Dimensionality Reduction
  • Diseases And Disorders
  • Engineering
  • Escherichia Coli
  • Graphical User Interface
  • Machine Learning
  • Neural Networks
  • Protein Engineering
  • Proteins
  • Standards
  • Supervised Machine Learning
  • Test Sets
  • User Interface

Fields of Study

  • Computer science

Readers

  • Computational Modeling and Simulation
  • Molecular and Cellular Biochemistry
  • Neural Network Machine Learning.

Technology Areas

  • AI & ML
  • AI & ML - Neural Networks
  • Biotechnology