Improved datasets and evaluation methods for the automatic prediction of DNA-binding proteins
Abstract
Accurate automatic annotation of protein function relies on both innovative models and robust datasets. Due to their importance in biological processes, the identification of DNA-binding proteins directly from protein sequence has been the focus of many studies. However, the datasets used to train and evaluate these methods have suffered from substantial flaws. We describe some of the weaknesses of the datasets used in previous DNA-binding protein literature and provide several new datasets addressing these problems. We suggest new evaluative benchmark tasks that more realistically assess real-world performance for protein annotation models. We propose a simple new model for the prediction of DNA-binding proteins and compare its performance on the improved datasets to two previously published models. In addition, we provide extensive tests showing how the best models predict across taxa.
Document Details
- Document Type
- Pub Defense Publication
- Publication Date
- Aug 20, 2021
- Source ID
- 10.1093/bioinformatics/btab603
Entities
People
- Alexander Zaitzeff
- Francis C Motta
- Jedediah M. Singer
- Nicholas Leiby
- Steven B. Haase
Organizations
- Air Force Research Laboratory
- Defense Advanced Research Projects Agency
- Duke University
- Florida Atlantic University
- United States Department of Defense