Controlled Generation of Protein Sequences from Tree-Preserving Embeddings and Generative Deep Learning Models
Abstract
Rapid development of novel biomolecular sequences, and in particular proteins, is essential for a range of fields from drug development to food safety, and recent advances of generative deep learning techniques for sequence generation has demonstrated their potential large value for these applications. However, progression from simple exploratory data analysis to controlled generation of novel sequences that resemble those in nature remains cumbersome. Dimensionality reduction relevant to exploratory data analysis, which embeds high-dimensional data in a low-dimensional space while preserving some level of structure in the data, can enable useful clustering. While clustering methods including principal component analysis, multidimensional scaling, and stochastic neighbor embedding are popular in protein sequence analysis, their direct application to sequence datasets generally fail to separate clusters, and the clusters that form do not appear in-line with known attributes of the proteins being analyzed. In contrast, tree-preserving embeddings demonstrate remarkable performance for logical clustering that follow from dendrograms showing sequence-wise distance. Here we developed a new approach utilizing tree-preserving embeddings for controlled sampling in sequence space developed by several different generative models. Without restoring to more complex models, the generated sequences conform to proteins with known attributes along with high sequence similarity. This combined method requires no additional data sources and can deal with a wide range of different protein types, simplifying its use for generation of new proteins and other biomolecular sequences.
Document Details
- Document Type
- Technical Report
- Publication Date
- May 03, 2023
- Accession Number
- AD1200590
Entities
People
- Jerome A. Alvarez
- Scott N Dean
Organizations
- United States Naval Research Laboratory