Scalable Microbial Strain Inference in Metagenomic Data Using StrainFacts

Abstract

While genome databases are nearing a complete catalog of species commonly inhabiting the human gut, their representation of intraspecific diversity is lacking for all but the most abundant and frequently studied taxa. Statistical deconvolution of allele frequencies from shotgun metagenomic data into strain genotypes and relative abundances is a promising approach, but existing methods are limited by computational scalability. Here we introduce StrainFacts, a method for strain deconvolution that enables inference across tens of thousands of metagenomes. We harness a “fuzzy” genotype approximation that makes the underlying graphical model fully differentiable, unlike existing methods. This allows parameter estimates to be optimized with gradient-based methods, speeding up model fitting by two orders of magnitude. A GPU implementation provides additional scalability. Extensive simulations show that StrainFacts can perform strain inference on thousands of metagenomes and has comparable accuracy to more computationally intensive tools. We further validate our strain inferences using single-cell genomic sequencing from a human stool sample. Applying StrainFacts to a collection of more than 10,000 publicly available human stool metagenomes, we quantify patterns of strain diversity, biogeography, and linkage-disequilibrium that agree with and expand on what is known based on existing reference genomes. StrainFacts paves the way for large-scale biogeography and population genetic studies of microbiomes using metagenomic data.

Document Details

Document Type
Pub Defense Publication
Publication Date
May 16, 2022
Source ID
10.3389/fbinf.2022.867386

Entities

People

  • Adam R. Abate
  • Byron J. Smith
  • Katherine Pollard
  • Xiangpeng Li
  • Zhou Jason Shi

Organizations

  • National Institutes of Health
  • National Science Foundation
  • Office of the Director of National Intelligence

Tags

Fields of Study

  • Biology

Readers

  • Computational Modeling and Simulation
  • Gulf War Illness and Chronic Multisymptom Illness in Veterans.
  • Molecular Genetics

Technology Areas

  • AI & ML
  • AI & ML - Bayesian Inference
  • AI & ML - Machine Learning Algorithms
  • Biotechnology