Alignment-free estimation of sequence conservation for identifying functional sites using protein sequence embeddings

Abstract

Protein language modeling is a fast-emerging deep learning method in bioinformatics with diverse applications such as structure prediction and protein design. However, application toward estimating sequence conservation for functional site prediction has not been systematically explored. Here, we present a method for the alignment-free estimation of sequence conservation using sequence embeddings generated from protein language models. Comprehensive benchmarks across publicly available protein language models reveal that ESM2 models provide the best performance to computational cost ratio for conservation estimation. Applying our method to full-length protein sequences, we demonstrate that embedding-based methods are not sensitive to the order of conserved elements—conservation scores can be calculated for multidomain proteins in a single run, without the need to separate individual domains. Our method can also identify conserved functional sites within fast-evolving sequence regions (such as domain inserts), which we demonstrate through the identification of conserved phosphorylation motifs in variable insert segments in protein kinases. Overall, embedding-based conservation analysis is a broadly applicable method for identifying potential functional sites in any full-length protein sequence and estimating conservation in an alignment-free manner. To run this on your protein sequence of interest, try our scripts at https://github.com/esbgkannan/kibby.

Document Details

Document Type: Pub Defense Publication
Publication Date: Jan 01, 2023
Source ID: 10.1093/bib/bbac599

Entities

People

Natarajan Kannan
Sheng Li
Wayland Yeung
Zhongliang Zhou

Organizations

University of Georgia
University of Virginia

Alignment-free estimation of sequence conservation for identifying functional sites using protein sequence embeddings

Abstract

Document Details

Entities

People

Organizations

Tags

Fields of Study

Readers

Technology Areas