SeqScreen: accurate and sensitive functional screening of pathogenic sequences via ensemble learning

Abstract

The COVID-19 pandemic has emphasized the importance of accurate detection of known and emerging pathogens. However, robust characterization of pathogenic sequences remains an open challenge. To address this need we developed SeqScreen, which accurately characterizes short nucleotide sequences using taxonomic and functional labels and a customized set of curated Functions of Sequences of Concern (FunSoCs) specific to microbial pathogenesis. We show our ensemble machine learning model can label protein-coding sequences with FunSoCs with high recall and precision. SeqScreen is a step towards a novel paradigm of functionally informed synthetic DNA screening and pathogen characterization, available for download atwww.gitlab.com/treangenlab/seqscreen.

Document Details

Document Type
Pub Defense Publication
Publication Date
Jun 20, 2022
Source ID
10.1186/s13059-022-02695-x

Entities

People

  • Advait Balaji
  • Anthony D. Kappell
  • Bryce Kille
  • Daniel J. Nasko
  • Dreycey Albin
  • Gene D. Godbold
  • Krista L Ternus
  • Madeline Diep
  • Mihai Pop
  • Nidhi Shah
  • R A Leo Elworth
  • Santiago Segarra
  • Todd J Treangen
  • Zhiqin Qian

Organizations

  • Division of Computer and Network Systems
  • Division of Intramural Research, National Institute of Allergy and Infectious Diseases
  • Intelligence Advanced Research Projects Activity
  • National Science Foundation Directorate for Biological Sciences
  • United States National Library of Medicine

Tags

Fields of Study

  • Biology

Readers

  • Distributed Systems and Data Platform Development
  • Infectious Disease/Epidemiology
  • Molecular Genetics

Technology Areas

  • AI & ML
  • AI & ML - Neural Networks
  • Biotechnology