RefSeq Database Growth Influences the Accuracy of k-mer-Based Lowest Common Ancestor Species Identification

Abstract

In order to determine the role of the database in taxonomic sequence classification, we examine the influence of the database over time on k-mer-based lowest common ancestor taxonomic classification. We present three major findings: the number of new species added to the NCBI RefSeq database greatly outpaces the number of new genera; as a result, more reads are classified with newer database versions, but fewer are classified at the species level; and Bayesian-based re-estimation mitigates this effect but struggles with novel genomes. These results suggest a need for new classification approaches specially adapted for large databases.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Oct 30, 2018
Accession Number
AD1099688

Entities

People

  • Adam M. Phillippy
  • Daniel J. Nasko
  • Sergey Koren
  • Todd J Treangen

Organizations

  • University of Maryland

Tags

Communities of Interest

  • Biomedical

DTIC Thesaurus Topics

  • Accuracy
  • Bacteria
  • Biology
  • Classification
  • Computational Biology
  • Computer Science
  • Databases
  • Emerging Technology
  • Genome
  • Human Genome
  • Intelligence Community (United States)
  • Microbial Genome
  • Microbiomes
  • Microorganisms
  • Sequences
  • Staphylococcus Aureus
  • Taxonomy

Readers

  • Database Systems and Applications
  • Systems Analysis and Design
  • Vector-Borne Disease and Entomology

Technology Areas

  • AI & ML