Statistical and Computational Tradeoffs in Massive Data

Abstract

1 Proposal Summary With rapid advancement of science and technology, data are presently being accumulated at unprecedented speed. Examples include medical records of all patients in a large health care provider over time, world climate data, wireless sensor network data, and so on. When data are larger in size than that can be stored, processed and analyzed in the traditional ways in reasonable time, we are entering the era of the so called “Big Data.” The massive sample size introduces unique computational and statistical challenges. Let us start from a recent news on big data. On August 6, 2014, Nature1 released news: “US Big-Data Health Network Launches Aspirin Study.” In this $10-million pilot study, the use of aspirin to prevent heart disease will be investigated. Specifically, participants will take daily doses of aspirin that fall within the range typically prescribed for heart disease, and be monitored to determine whether one dosage works better than the others. The health-care data such as insurance claims, blood tests and medical histories will be collected from as many as 30 million people in the United States through PCORnet2. The system, PCORnet, will connect multiple smaller networks, giving researchers access to records at a large number of institutions without creating a central data repository. This decentralization creates one of the greatest challenges on how to merge and standardize data from different networks to enable accurate comparison. The many types of data – scans from medical imaging, vitalsigns records and, eventually, genetic information can be messy, and record-keeping systems vary among health-care institutions. Motivated by this US health network data, we summarize the features of big data as 4D: • Distributed: computation and storage bottleneck; • Dirty: the curse of heterogeneity, e.g., unstructured data; • Dimensionality: accompany with a large sample size and growing; • Dynamic: varying and unknown underlying distribution, e.g., temporal data. All these features, which are often mixed together in reality, make it very challenging to apply traditional statistical thinking to massive data. For example, how to allocate a limited computational budget for conducting the best possible statistical analysis in a parallel computing environment? Another example is how to efficiently extract common features across many sub-populations in massive heterogeneous data while exploring heterogeneity of each subpopulation even when the number of sub-population grows. These tasks become more formidable when the underlying distribution is unknown and varies as data accumulates. Addressing these questions in various scenarios is the main purpose of our proposal, which is also closely related to Mathematical Data Science’s mission to promote big data research. The consideration of computation in the statistical analysis forms the core of our proposal. Our proposal consists of three projects addressing different challenges of massive data in the semi/nonparametric regression setup.

Document Details

Document Type
DoD Grant Award
Publication Date
Aug 12, 2016
Source ID
N000141512331

Entities

People

  • Guang Cheng

Organizations

  • Office of Naval Research
  • United States Navy
  • University of Virginia

Tags

Readers

  • Distributed Systems and Data Platform Development
  • Medical or Health Care Field.
  • Regression Analysis.

Technology Areas

  • AI & ML
  • Biotechnology