Controlled Generation of Protein Sequences from Tree-Preserving Embeddings and Generative Deep Learning Models

Abstract

Rapid development of novel biomolecular sequences, and in particular proteins, is essential for a range of fields from drug development to food safety, and recent advances of generative deep learning techniques for sequence generation has demonstrated their potential large value for these applications. However, progression from simple exploratory data analysis to controlled generation of novel sequences that resemble those in nature remains cumbersome. Dimensionality reduction relevant to exploratory data analysis, which embeds high-dimensional data in a low-dimensional space while preserving some level of structure in the data, can enable useful clustering. While clustering methods including principal component analysis, multidimensional scaling, and stochastic neighbor embedding are popular in protein sequence analysis, their direct application to sequence datasets generally fail to separate clusters, and the clusters that form do not appear in-line with known attributes of the proteins being analyzed. In contrast, tree-preserving embeddings demonstrate remarkable performance for logical clustering that follow from dendrograms showing sequence-wise distance. Here we developed a new approach utilizing tree-preserving embeddings for controlled sampling in sequence space developed by several different generative models. Without restoring to more complex models, the generated sequences conform to proteins with known attributes along with high sequence similarity. This combined method requires no additional data sources and can deal with a wide range of different protein types, simplifying its use for generation of new proteins and other biomolecular sequences.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
May 03, 2023
Accession Number
AD1200590

Entities

People

  • Jerome A. Alvarez
  • Scott N Dean

Organizations

  • United States Naval Research Laboratory

Tags

Communities of Interest

  • Energy and Power Technologies

DTIC Thesaurus Topics

  • Anti-Infective Agents
  • Artificial Intelligence Software
  • Computational Science
  • Data Analysis
  • Data Mining
  • Databases
  • Deep Learning
  • Dimensionality Reduction
  • Engineering
  • Generative Models
  • Information Processing
  • Information Science
  • Information Systems
  • Machine Learning
  • Neural Networks
  • Recurrent Neural Networks
  • Systems Biology
  • Three Dimensional
  • Two Dimensional

Fields of Study

  • Computer science

Readers

  • Molecular Genetics
  • Neural Network Machine Learning.
  • Systems Analysis and Design

Technology Areas

  • AI & ML
  • AI & ML - Neural Networks
  • Space