Population-Genetic Inference from Pooled-Sequencing Data

Abstract

Although pooled-population sequencing has become a widely used approach for estimating allele frequencies, most work has proceeded in the absence of a proper statistical framework. We introduce a self-sufficient, closed-form, maximum-likelihood estimator for allele frequencies that accounts for errors associated with sequencing, and a likelihood-ratio test statistic that provides a simple means for evaluating the null hypothesis of monomorphism. Unbiased estimates of allele frequencies< 5=N (where N is the number of individuals sampled) appear to be unachievable, and near-certain identification of a polymorphism requires a minor-allele frequency> 10=N. A framework is provided for testing for significant differences in allele frequencies between populations, taking into account sampling at the levels of individuals within populations and sequences within pooled samples. Analyses that fail to account for the two tiers of sampling suffer from very large false-positive rates and can become increasingly misleading with increasing depths of sequence coverage. The power to detect significant allele-frequency differences between two populations is very limited unless both the number of sampled individuals and depth of sequencing coverage exceed 100.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Apr 30, 2014
Accession Number
AD1067387

Entities

People

  • Darius Bost
  • Michael Lynch
  • Sade Wilson
  • Scott Harrison
  • Takahiro Maruki

Organizations

  • Indiana University Bloomington

Tags

Communities of Interest

  • Energy and Power Technologies

DTIC Thesaurus Topics

  • Biology
  • Data Processing
  • Demographic Cohorts
  • Detection
  • Drosophila
  • Equations
  • Errors
  • Estimators
  • Frequency
  • Genetic Structures
  • Genetics
  • Genome
  • Information Science
  • North Carolina
  • Nucleotides
  • Statistical Tests
  • Universities

Fields of Study

  • Biology
  • Mathematics

Readers

  • Molecular and genetic basis of cancer.
  • Regression Analysis.

Technology Areas

  • AI & ML
  • AI & ML - Bayesian Inference
  • Biotechnology