Applying Machine Learning with Model Interpretability for Sequence-Based Protein Solubility Prediction

Abstract

The prediction of protein solubility is essential for basic research on natural proteins but increasingly for production and investigation of engineered or designed proteins, where experimental confirmation of the engineered properties hinges on the ability to produce it. Thus, accurate predictions of protein solubility are widely sought after by protein engineers. Here we present a new approach which uses an extreme gradient boosting (XGBoost) algorithm fed by a variety of data sources including predicted solvent accessibility, secondary structure, among others, to predict solubility of proteins. Our model achieves a high level of performance using a standard hold-out test set, with an overall accuracy of 72%, among the highest for sequence-based machine learning models. Critically, our system also yields information on the features important for the predictions, making use of explainable artificial intelligence to provide both local and global explainers. Using this information, we found that the certain mono-, di-, and tri-peptides are strongly associated with solubility, as are metrics for protein disorder, relative solvent accessibility, and frequency of certain secondary structures, each of which are derived from other prediction models. Critically, in our graphical user interface for the model, we make use of local explanations to help inform the reasoning behind the predictions and suggest modifications. Our models accuracy paired with its interpretability should allow for rapid prediction of protein solubility, in particular for proteins and protein families without reliable structural information. This should greatly enhance our ability to experimentally produce and investigate proteins designed by machine learning-guided approaches and other protein engineering strategies.

Open PDF

Document Details

Document Type: Technical Report
Publication Date: Sep 16, 2022
Accession Number: AD1181125

Entities

People

Anthony P. Malanoski
Jerome A. Alvarez
Patricia M Legler
Scott N Dean

Organizations

United States Naval Research Laboratory

Applying Machine Learning with Model Interpretability for Sequence-Based Protein Solubility Prediction

Abstract

Document Details

Entities

People

Organizations

Tags

Communities of Interest

DTIC Thesaurus Topics

Fields of Study

Readers

Technology Areas