Controlled Generation of Protein Sequences from Tree-Preserving Embeddings and Generative Deep Learning Models

Abstract

Rapid development of novel biomolecular sequences, and in particular proteins, is essential for a range of fields from drug development to food safety, and recent advances of generative deep learning techniques for sequence generation has demonstrated their potential large value for these applications. However, progression from simple exploratory data analysis to controlled generation of novel sequences that resemble those in nature remains cumbersome. Dimensionality reduction relevant to exploratory data analysis, which embeds high-dimensional data in a low-dimensional space while preserving some level of structure in the data, can enable useful clustering. While clustering methods including principal component analysis, multidimensional scaling, and stochastic neighbor embedding are popular in protein sequence analysis, their direct application to sequence datasets generally fail to separate clusters, and the clusters that form do not appear in-line with known attributes of the proteins being analyzed. In contrast, tree-preserving embeddings demonstrate remarkable performance for logical clustering that follow from dendrograms showing sequence-wise distance. Here we developed a new approach utilizing tree-preserving embeddings for controlled sampling in sequence space developed by several different generative models. Without restoring to more complex models, the generated sequences conform to proteins with known attributes along with high sequence similarity. This combined method requires no additional data sources and can deal with a wide range of different protein types, simplifying its use for generation of new proteins and other biomolecular sequences.

Open PDF

Document Details

Document Type: Technical Report
Publication Date: May 03, 2023
Accession Number: AD1200590

Entities

People

Jerome A. Alvarez
Scott N Dean

Organizations

United States Naval Research Laboratory

Controlled Generation of Protein Sequences from Tree-Preserving Embeddings and Generative Deep Learning Models

Abstract

Document Details

Entities

People

Organizations

Tags

Communities of Interest

DTIC Thesaurus Topics

Fields of Study

Readers

Technology Areas