Improved datasets and evaluation methods for the automatic prediction of DNA-binding proteins

Abstract

Accurate automatic annotation of protein function relies on both innovative models and robust datasets. Due to their importance in biological processes, the identification of DNA-binding proteins directly from protein sequence has been the focus of many studies. However, the datasets used to train and evaluate these methods have suffered from substantial flaws. We describe some of the weaknesses of the datasets used in previous DNA-binding protein literature and provide several new datasets addressing these problems. We suggest new evaluative benchmark tasks that more realistically assess real-world performance for protein annotation models. We propose a simple new model for the prediction of DNA-binding proteins and compare its performance on the improved datasets to two previously published models. In addition, we provide extensive tests showing how the best models predict across taxa.

Document Details

Document Type: Pub Defense Publication
Publication Date: Aug 20, 2021
Source ID: 10.1093/bioinformatics/btab603

Entities

People

Alexander Zaitzeff
Francis C Motta
Jedediah M. Singer
Nicholas Leiby
Steven B. Haase

Organizations

Air Force Research Laboratory
Defense Advanced Research Projects Agency
Duke University
Florida Atlantic University
United States Department of Defense

Improved datasets and evaluation methods for the automatic prediction of DNA-binding proteins

Abstract

Document Details

Entities

People

Organizations

Tags

Fields of Study

Readers