To Petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics

Abstract

As computational biologists continue to be inundated by ever increasing amounts of metagenomic data, the need for data analysis approaches that keep up with the pace of sequence archives has remained a challenge. In recent years, the accelerated pace of genomic data availability has been accompanied by the application of a wide array of highly efficient approaches from other fields to the field of metagenomics. For instance, sketching algorithms such as MinHash have seen a rapid and widespread adoption. These techniques handle increasingly large datasets with minimal sacrifices in quality for tasks such as sequence similarity calculations. Here, we briefly review the fundamentals of the most impactful probabilistic and signal processing algorithms. We also highlight more recent advances to augment previous reviews in these areas that have taken a broader approach. We then explore the application of these techniques to metagenomics, discuss their pros and cons, and speculate on their future directions.

Document Details

Document Type
Pub Defense Publication
Publication Date
Apr 27, 2020
Source ID
10.1093/nar/gkaa265

Entities

People

  • Advait Balaji
  • Anshumali Shrivastava
  • Benjamin Coleman
  • C. J. Barberan
  • Gaurav Gupta
  • Pavan K. Kota
  • Qi Wang
  • R A Leo Elworth
  • Richard G. Baraniuk
  • Todd J Treangen

Organizations

  • Air Force Office of Scientific Research
  • Army Research Office
  • Defense Advanced Research Projects Agency
  • Department of Computer Science, University of Oxford
  • Intelligence Advanced Research Projects Activity
  • National Institute of Neurological Disorders and Stroke
  • National Institutes of Health
  • National Science Foundation
  • Office of Naval Research
  • Office of the Director of National Intelligence
  • Rice University
  • United States National Library of Medicine

Tags

Fields of Study

  • Biology

Readers

  • Distributed Systems and Data Platform Development
  • Economics