RefSeq Database Growth Influences the Accuracy of k-mer-Based Lowest Common Ancestor Species Identification
Abstract
In order to determine the role of the database in taxonomic sequence classification, we examine the influence of the database over time on k-mer-based lowest common ancestor taxonomic classification. We present three major findings: the number of new species added to the NCBI RefSeq database greatly outpaces the number of new genera; as a result, more reads are classified with newer database versions, but fewer are classified at the species level; and Bayesian-based re-estimation mitigates this effect but struggles with novel genomes. These results suggest a need for new classification approaches specially adapted for large databases.
Document Details
- Document Type
- Technical Report
- Publication Date
- Oct 30, 2018
- Accession Number
- AD1099688
Entities
People
- Adam M. Phillippy
- Daniel J. Nasko
- Sergey Koren
- Todd J Treangen
Organizations
- University of Maryland