Collaborative Research: Efficient Parallel Iterative Monte Carlo Methods for Statistical Analysis of Big Data

合作研究:用于大数据统计分析的高效并行迭代蒙特卡罗方法

基本信息

  • 批准号:
    1316922
  • 负责人:
  • 金额:
    $ 8.16万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2013
  • 资助国家:
    美国
  • 起止时间:
    2013-08-01 至 2016-07-31
  • 项目状态:
    已结题

项目摘要

The integration of computer technology into science and daily life has enabled the collection of massive volumes of data. To analyze these data, one may have to resort to parallel and distributed architectures. While the parallel and distributed architectures present new capabilities for storage and manipulation of big data, it is unclear, from the inferential point of view, how the current statistical methodology can be transported to the paradigm of big data. Also, growing data size typically comes together with a growing complexity of data structures and of the models needed to account for the structures. Although iterative Monte Carlo algorithms, such as the Markov chain Monte Carlo (MCMC), stochastic approximation, and expectation-maximization (EM) algorithms, have proven to be very powerful and typically unique computational tools for analyzing data of complex structures, they are infeasible for big data as for which a large number of iterations and a complete scan of the full dataset for each iteration are typically required. Big data have put a great challenge on the current statistical methodology. The investigators propose a general principle for developing Monte Carlo algorithms that are feasible for big data and workable on parallel and distributed architectures; that is, using Monte Carlo averages calculated in parallel from subsamples to approximate the quantities that originally need to calculate from the full dataset. This principle avoids the requirement for repeated scans of full data in algorithm iterations, while enabling the algorithm to produce statistically sensible solutions to the problem under consideration. Under this principle, a general algorithm, the so-called subsampling approximation-based parallel stochastic approximation algorithm, is proposed for parameter estimation for big data problems. Unlike the existing algorithms, such as the bag of little bootstraps, aggregated estimation equation, and split-and-conquer algorithms, the proposed algorithm works for the problems for which the observations are generally dependent. Under the same principle, a subsampling approximation-based parallel Metropolis-Hastings algorithm is proposed for Bayesian analysis of big data, and a subsampling approximation-based parallel Monte Carlo EM algorithm is proposed for parameter estimation for the big data problems with missing observations. In addition to the subsampling approximation-based parallel iterative Monte Carlo algorithms, an embarrassingly parallel MCMC algorithm is proposed for Bayesian analysis of big data based on the popular idea of divide-and-conquer. Various schemes of dataset partition and results aggregation are proposed. The validity of the proposed parallel iterative Monte Carlo algorithms, including both the subsampling approximation-based and embarrassingly parallel ones, will be rigorously studied. The proposed algorithms will be applied to spatio-temporal modeling of satellite climate data, genome-wide association study, and stream data analysis.The intellectual merit of this project is to propose a general principle for statistical analysis of big data: Using Monte Carlo averages of subsamples to approximate the quantities that originally need to calculate from the full dataset. This principle provides a general strategy for transporting the current statistical methodology to the paradigm of big data. Under this principle, a few subsampling approximation-based parallel iterative Monte Carlo algorithms are proposed. The proposed algorithms address the core problem of big data analysis?how to make a statistically sensible analysis for big data while avoiding repeated scans of the full dataset. This project will have broader impacts because big data are ubiquitous throughout almost all fields of science and technology. A successful research program in theory and methods of parallel iterative Monte Carlo computations can have immense benefit widely throughout science and technology. The research results will be disseminated to the communities of interest, such as atmospheric science, biomedical science, engineering, and social science, via direct collaboration with researchers in these disciplines, conference presentations, books, and papers to be published in academic journals. The project will have also significant impacts on education through direct involvement of graduate students in the project and incorporation of results into undergraduate and graduate courses. In addition, the package Distributed Iterative Statistical Computing (DISC) that will be developed under this project is designed to provide a platform for Ph.D. students and researchers like the investigators with network-connected computers to experiment new ideas of developing efficient iterative Monte Carlo algorithms in parallel or, more exactly, grid computing environments.
计算机技术与科学和日常生活的结合使大量数据的收集成为可能。为了分析这些数据,人们可能不得不求助于并行和分布式架构。虽然并行和分布式架构为大数据的存储和操作提供了新的能力,但从推理的角度来看,目前还不清楚如何将当前的统计方法转移到大数据的范式中。此外,不断增长的数据大小通常伴随着数据结构和考虑结构所需的模型的不断增长的复杂性。尽管迭代蒙特卡罗算法(诸如马尔可夫链蒙特卡罗(MCMC)、随机近似和期望最大化(EM)算法)已被证明是用于分析复杂结构的数据的非常强大且典型地独特的计算工具,但是它们对于大数据是不可行的,因为对于大数据,典型地需要大量迭代和针对每次迭代的完整数据集的完整扫描。大数据对现有的统计方法提出了巨大的挑战。研究人员提出了一个开发蒙特卡洛算法的一般原则,该算法适用于大数据,并可在并行和分布式架构上工作;即,使用从子样本并行计算的蒙特卡洛平均值来近似最初需要从完整数据集计算的数量。该原理避免了在算法迭代中重复扫描完整数据的要求,同时使算法能够为所考虑的问题产生统计上合理的解决方案。在此基础上,提出了一种用于大数据参数估计的通用算法--基于子采样近似的并行随机近似算法。与现有的算法,如袋的小Bootstrap,聚合估计方程,和分裂和征服算法,所提出的算法工程的问题,其观测值通常是依赖的。在相同的原理下,针对大数据贝叶斯分析问题,提出了一种基于欠采样近似的并行Metropolis-Hastings算法;针对大数据缺失观测问题,提出了一种基于欠采样近似的并行Monte Carlo EM算法。在基于子采样近似的并行迭代Monte Carlo算法的基础上,提出了一种基于分治思想的大数据贝叶斯分析并行MCMC算法。提出了数据集划分和结果聚合的各种方案。所提出的并行迭代Monte Carlo算法,包括基于子采样近似的并行算法和基于子采样近似的并行算法的有效性,将被严格研究。该算法将应用于卫星气候数据的时空建模、全基因组关联研究和流数据分析。该项目的智力价值在于提出了大数据统计分析的一般原则:使用子样本的Monte Carlo平均值来近似最初需要从完整数据集计算的数量。这一原则为将当前的统计方法转变为海量数据范式提供了一个总体战略。在此基础上,提出了几种基于欠采样近似的并行迭代蒙特卡罗算法。所提出的算法解决了大数据分析的核心问题?如何对大数据进行统计上合理的分析,同时避免重复扫描整个数据集。该项目将产生更广泛的影响,因为大数据在几乎所有科学和技术领域都无处不在。一个成功的并行迭代蒙特卡罗计算的理论和方法的研究计划可以在整个科学和技术中产生巨大的利益。 研究成果将通过与这些学科的研究人员直接合作、会议报告、书籍和在学术期刊上发表的论文,传播给感兴趣的社区,如大气科学、生物医学科学、工程和社会科学。该项目还将通过研究生直接参与项目并将成果纳入本科生和研究生课程,对教育产生重大影响。此外,将在本项目下开发的软件包分布式迭代统计计算(DISC)旨在为博士生提供一个平台。学生和研究人员喜欢研究人员与网络连接的计算机,以实验新的想法,开发高效的迭代蒙特卡罗算法在并行或更确切地说,网格计算环境。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Chuanhai Liu其他文献

Alternating Subspace-Spanning Resampling to Accelerate Markov Chain Monte Carlo Simulation
交替子空间跨越重采样加速马尔可夫链蒙特卡罗模拟
  • DOI:
    10.1198/016214503388619148
  • 发表时间:
    2003
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Chuanhai Liu
  • 通讯作者:
    Chuanhai Liu
Not Asked and Not Answered: Multiple Imputation for Multiple Surveys: Rejoinder
没有询问也没有回答:多项调查的多重插补:反驳
  • DOI:
  • 发表时间:
    1998
  • 期刊:
  • 影响因子:
    0
  • 作者:
    A. Gelman;Gary King;Chuanhai Liu
  • 通讯作者:
    Chuanhai Liu
Reweighted Anderson-Darling Tests of Goodness-of-Fit
重新加权的 Anderson-Darling 拟合优度检验
  • DOI:
  • 发表时间:
    2022
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Chuanhai Liu
  • 通讯作者:
    Chuanhai Liu
Parameter Expansion and Efficient Inference
参数扩展和高效推理
  • DOI:
  • 发表时间:
    2010
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Andrew Lewandowski;Chuanhai Liu;S. V. Wiel
  • 通讯作者:
    S. V. Wiel
Settle the unsettling: an inferential models perspective
解决令人不安的问题:推理模型的视角
  • DOI:
  • 发表时间:
    2021
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Chuanhai Liu;Ryan Martin
  • 通讯作者:
    Ryan Martin

Chuanhai Liu的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Chuanhai Liu', 18)}}的其他基金

Collaborative Research: Prior-free probabilistic inferential methods for "large-p-small-n" linear regression problems
合作研究:“大-p-小-n”线性回归问题的无先验概率推理方法
  • 批准号:
    1208841
  • 财政年份:
    2012
  • 资助金额:
    $ 8.16万
  • 项目类别:
    Continuing Grant
Large-Scale Multinomial Inference and Its Applications in Genome-Wide Association Studies
大规模多项式推理及其在全基因组关联研究中的应用
  • 批准号:
    1007678
  • 财政年份:
    2010
  • 资助金额:
    $ 8.16万
  • 项目类别:
    Continuing Grant

相似国自然基金

Research on Quantum Field Theory without a Lagrangian Description
  • 批准号:
    24ZR1403900
  • 批准年份:
    2024
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
Cell Research
  • 批准号:
    31224802
  • 批准年份:
    2012
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Cell Research
  • 批准号:
    31024804
  • 批准年份:
    2010
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Cell Research (细胞研究)
  • 批准号:
    30824808
  • 批准年份:
    2008
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Research on the Rapid Growth Mechanism of KDP Crystal
  • 批准号:
    10774081
  • 批准年份:
    2007
  • 资助金额:
    45.0 万元
  • 项目类别:
    面上项目

相似海外基金

Collaborative Research: Beyond the Single-Atom Paradigm: A Priori Design of Dual-Atom Alloy Active Sites for Efficient and Selective Chemical Conversions
合作研究:超越单原子范式:双原子合金活性位点的先验设计,用于高效和选择性化学转化
  • 批准号:
    2334970
  • 财政年份:
    2024
  • 资助金额:
    $ 8.16万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Small: Efficient and Scalable Privacy-Preserving Neural Network Inference based on Ciphertext-Ciphertext Fully Homomorphic Encryption
合作研究:SHF:小型:基于密文-密文全同态加密的高效、可扩展的隐私保护神经网络推理
  • 批准号:
    2412357
  • 财政年份:
    2024
  • 资助金额:
    $ 8.16万
  • 项目类别:
    Standard Grant
Collaborative Research: Reversible Computing and Reservoir Computing with Magnetic Skyrmions for Energy-Efficient Boolean Logic and Artificial Intelligence Hardware
合作研究:用于节能布尔逻辑和人工智能硬件的磁斯格明子可逆计算和储层计算
  • 批准号:
    2343606
  • 财政年份:
    2024
  • 资助金额:
    $ 8.16万
  • 项目类别:
    Standard Grant
Collaborative Research: Beyond the Single-Atom Paradigm: A Priori Design of Dual-Atom Alloy Active Sites for Efficient and Selective Chemical Conversions
合作研究:超越单原子范式:双原子合金活性位点的先验设计,用于高效和选择性化学转化
  • 批准号:
    2334969
  • 财政年份:
    2024
  • 资助金额:
    $ 8.16万
  • 项目类别:
    Standard Grant
Collaborative Research: Integrated Materials-Manufacturing-Controls Framework for Efficient and Resilient Manufacturing Systems
协作研究:高效、弹性制造系统的集成材料制造控制框架
  • 批准号:
    2346650
  • 财政年份:
    2024
  • 资助金额:
    $ 8.16万
  • 项目类别:
    Standard Grant
Collaborative Research: Integrated Materials-Manufacturing-Controls Framework for Efficient and Resilient Manufacturing Systems
协作研究:高效、弹性制造系统的集成材料制造控制框架
  • 批准号:
    2346651
  • 财政年份:
    2024
  • 资助金额:
    $ 8.16万
  • 项目类别:
    Standard Grant
Collaborative Research: FET: Medium:Compact and Energy-Efficient Compute-in-Memory Accelerator for Deep Learning Leveraging Ferroelectric Vertical NAND Memory
合作研究:FET:中型:紧凑且节能的内存计算加速器,用于利用铁电垂直 NAND 内存进行深度学习
  • 批准号:
    2312886
  • 财政年份:
    2023
  • 资助金额:
    $ 8.16万
  • 项目类别:
    Standard Grant
Collaborative Research: FET: Medium:Compact and Energy-Efficient Compute-in-Memory Accelerator for Deep Learning Leveraging Ferroelectric Vertical NAND Memory
合作研究:FET:中型:紧凑且节能的内存计算加速器,用于利用铁电垂直 NAND 内存进行深度学习
  • 批准号:
    2312884
  • 财政年份:
    2023
  • 资助金额:
    $ 8.16万
  • 项目类别:
    Standard Grant
Collaborative Research: FET: Medium: Efficient Compilation for Dynamically Reconfigurable Atom Arrays
合作研究:FET:中:动态可重构原子阵列的高效编译
  • 批准号:
    2313084
  • 财政年份:
    2023
  • 资助金额:
    $ 8.16万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Small: Quasi Weightless Neural Networks for Energy-Efficient Machine Learning on the Edge
合作研究:SHF:小型:用于边缘节能机器学习的准失重神经网络
  • 批准号:
    2326895
  • 财政年份:
    2023
  • 资助金额:
    $ 8.16万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了