Applying Machine Learning with Model Interpretability for Sequence-Based Protein Solubility Prediction
Abstract
The prediction of protein solubility is essential for basic research on natural proteins but increasingly for production and investigation of engineered or designed proteins, where experimental confirmation of the engineered properties hinges on the ability to produce it. Thus, accurate predictions of protein solubility are widely sought after by protein engineers. Here we present a new approach which uses an extreme gradient boosting (XGBoost) algorithm fed by a variety of data sources including predicted solvent accessibility, secondary structure, among others, to predict solubility of proteins. Our model achieves a high level of performance using a standard hold-out test set, with an overall accuracy of 72%, among the highest for sequence-based machine learning models. Critically, our system also yields information on the features important for the predictions, making use of explainable artificial intelligence to provide both local and global explainers. Using this information, we found that the certain mono-, di-, and tri-peptides are strongly associated with solubility, as are metrics for protein disorder, relative solvent accessibility, and frequency of certain secondary structures, each of which are derived from other prediction models. Critically, in our graphical user interface for the model, we make use of local explanations to help inform the reasoning behind the predictions and suggest modifications. Our models accuracy paired with its interpretability should allow for rapid prediction of protein solubility, in particular for proteins and protein families without reliable structural information. This should greatly enhance our ability to experimentally produce and investigate proteins designed by machine learning-guided approaches and other protein engineering strategies.
Document Details
- Document Type
- Technical Report
- Publication Date
- Sep 16, 2022
- Accession Number
- AD1181125
Entities
People
- Anthony P. Malanoski
- Jerome A. Alvarez
- Patricia M Legler
- Scott N Dean
Organizations
- United States Naval Research Laboratory