BIGDATA: F: Statistical Approaches to Big Data Analytics
BIGDATA:F:大数据分析的统计方法
基本信息
- 批准号:1633074
- 负责人:
- 金额:$ 50万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2016
- 资助国家:美国
- 起止时间:2016-09-01 至 2021-08-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
The goals of this project include developing new Big Data analytical methods, providing an insightful understanding of their properties, and demonstrating major improvements over existing methods. While the driving application is cancer research, the lessons learned will be broadly applicable to a wide array of Big Data contexts. The major challenges addressed here include Data Integration, Data Heterogeneity and Parallelization. Data Integration is a recently understood need for combining widely differing types of measurements made on a common set of subjects. For example, in cancer research, common measurements in modern Big Data sets include gene expression, copy number, mutations, methylation and protein expression. The development of deep new statistical methods is proposed which focus on central scientific issues such as how the various measurements interact with each other, and simultaneously on which aspects operate in an independent manner. Data Heterogeneity addresses a different issue which is also critical in cancer research. In this case, current efforts to boost sample sizes (essential to deeper scientific insights) involve multiple laboratories combining their data. A whole new conceptual model for understanding the bias-oriented challenges presented by this scenario, plus the foundations for the development of new analytical methods that are robust against such effects, will be developed here. Parallelization is the computational concept of doing large scale numerical analysis through the simultaneous use of multiple computer processors. The proposed research will provide new foundational understanding of several important issues in this area.Data Integration will center on the Joint and Individual Variation Explained methodology. Early versions have already provided scientific insights not available from previous data analytic approaches. The basic idea will be first extended in the direction of more insightful groupings of data blocks, essential for understanding the full breadth of relationships between the available measurement types. The second extension will be in the direction of divergent groups of subjects, very important to the study of subtypes in cancer research and to the rest of precision medicine. In addition to new methodology, new methods of validation are proposed, and an asymptotic study of the properties will be conducted. The key new concept behind Data Heterogeneity is to replace the usual Gaussian conceptual model with a Gaussian mixture model, which makes intuitive sense but creates challenges, for example when using likelihood approaches as the mixture distributions are not an exponential family. An even bigger challenge is that mere scale issues usually entail that full estimation of the distributional parameters is completely intractable. Yet many standard statistical methods can be negatively impacted by such structure in data, so the invention of a new class of statistical methods that are robust against this effect, without requiring full parameter estimation, are proposed. Validation and development of mathematical statistical insights will again be an important part of the research. Parallelization is an essential component of all modern computing environments. The proposed research takes a Fiducial Inference viewpoint, which gives new insights into how the needed numerical calculations can be farmed out to a variety of processors, and then the results combined into a useful analysis for complicated statistical tasks including hypothesis testing and construction of confidence intervals.
该项目的目标包括开发新的大数据分析方法,提供对其特性的深刻理解,并展示对现有方法的重大改进。虽然驱动应用是癌症研究,但从中吸取的经验教训将广泛适用于各种大数据环境。这里讨论的主要挑战包括数据集成、数据异构和并行化。数据集成是最近才理解的一种需求,它将对一组共同主题进行的广泛不同类型的测量结合在一起。例如,在癌症研究中,现代大数据集的常见测量包括基因表达、拷贝数、突变、甲基化和蛋白质表达。提出了深入的新统计方法的发展,重点关注中心科学问题,如各种测量如何相互作用,同时在哪些方面以独立的方式运作。数据异质性解决了另一个问题,这在癌症研究中也很重要。在这种情况下,目前增加样本量(对更深入的科学见解至关重要)的努力涉及多个实验室结合他们的数据。本文将提出一个全新的概念模型,用于理解这种情况所带来的以偏见为导向的挑战,并为开发新的分析方法奠定基础,这些方法可以抵御这种影响。并行化是通过同时使用多个计算机处理器进行大规模数值分析的计算概念。提出的研究将为该领域的几个重要问题提供新的基础理解。数据集成将以联合和个体变异解释方法为中心。早期的版本已经提供了以前的数据分析方法无法提供的科学见解。基本思想将首先向更有洞察力的数据块分组方向扩展,这对于理解可用测量类型之间关系的全部广度至关重要。第二个扩展方向将是不同的研究对象群体,这对癌症研究中的亚型研究和其他精准医学非常重要。除了新的方法外,还提出了新的验证方法,并将对性质进行渐近研究。数据异质性背后的关键新概念是用高斯混合模型取代通常的高斯概念模型,这有直观的意义,但也带来了挑战,例如当混合分布不是指数族时使用似然方法。一个更大的挑战是,仅仅是规模问题通常意味着对分布参数的全面估计是完全难以处理的。然而,许多标准的统计方法可能会受到这种数据结构的负面影响,因此,提出了一种新的统计方法的发明,这种方法可以抵抗这种影响,而不需要全参数估计。数学统计见解的验证和发展将再次成为研究的重要组成部分。并行化是所有现代计算环境的基本组成部分。提出的研究采用了基准推理的观点,它为如何将所需的数值计算分配给各种处理器提供了新的见解,然后将结果组合成复杂统计任务的有用分析,包括假设检验和置信区间的构建。
项目成果
期刊论文数量(23)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
BFF: Bayesian, Fiducial, Frequentist Analysis of Age Effects in Daily Diary Data
BFF:每日日记数据中年龄影响的贝叶斯、基准、频率分析
- DOI:10.1093/geronb/gbz100
- 发表时间:2019
- 期刊:
- 影响因子:0
- 作者:Neupert, Shevaun D.;Hannig, Jan;Ram, ed., Nilam
- 通讯作者:Ram, ed., Nilam
Generalized fiducial inference for logistic graded response models
逻辑分级响应模型的广义基准推理
- DOI:10.1007/s11336
- 发表时间:2017
- 期刊:
- 影响因子:3
- 作者:Liu, Y.;Hannig, J.
- 通讯作者:Hannig, J.
Comments on “A Gibbs Sampler for a Class of Random Convex Polytopes”
对“一类随机凸多面体的吉布斯采样器”的评论
- DOI:10.1080/01621459.2021.1950002
- 发表时间:2021
- 期刊:
- 影响因子:3.7
- 作者:Hoffman, Kentaro;Hannig, Jan;Zhang, Kai
- 通讯作者:Zhang, Kai
A note on optimal sampling strategy for structural variant detection using optical mapping
关于使用光学映射进行结构变异检测的最佳采样策略的说明
- DOI:10.1080/03610926.2020.1723638
- 发表时间:2021
- 期刊:
- 影响因子:0
- 作者:Li, Weiwei;Hannig, Jan;Jones, Corbin D.
- 通讯作者:Jones, Corbin D.
Deep fiducial inference
深度基准推理
- DOI:10.1002/sta4.308
- 发表时间:2020
- 期刊:
- 影响因子:1.7
- 作者:Li, Gang;Hannig, Jan
- 通讯作者:Hannig, Jan
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
James Marron其他文献
James Marron的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('James Marron', 18)}}的其他基金
Data Integration Via Analysis of Subspaces (DIVAS)
通过子空间分析 (DIVAS) 进行数据集成
- 批准号:
2113404 - 财政年份:2021
- 资助金额:
$ 50万 - 项目类别:
Standard Grant
Collaborative Research: Tree Structured Object Oriented Data Analysis
协作研究:树结构面向对象数据分析
- 批准号:
0854908 - 财政年份:2009
- 资助金额:
$ 50万 - 项目类别:
Standard Grant
Collaborative Research: Statistical Learning and Object Oriented Data Analysis
协作研究:统计学习和面向对象的数据分析
- 批准号:
0606577 - 财政年份:2006
- 资助金额:
$ 50万 - 项目类别:
Standard Grant
High Dimension - Low Sample Size Statistical Analysis
高维度-低样本量统计分析
- 批准号:
0308331 - 财政年份:2003
- 资助金额:
$ 50万 - 项目类别:
Continuing Grant
Populations of Complex Objects, Visualization and Smoothing
复杂对象的群体、可视化和平滑
- 批准号:
9971649 - 财政年份:1999
- 资助金额:
$ 50万 - 项目类别:
Continuing Grant
Mathematical Sciences: Nonparametric Curve Estimation
数学科学:非参数曲线估计
- 批准号:
9203135 - 财政年份:1992
- 资助金额:
$ 50万 - 项目类别:
Continuing Grant
U.S.-Belgium Cooperative Research: Bandwidth Selection and Construction of Confidence Bands in Nonparametric Regression
美国-比利时合作研究:非参数回归中的带宽选择和置信带构建
- 批准号:
9107498 - 财政年份:1991
- 资助金额:
$ 50万 - 项目类别:
Standard Grant
相似海外基金
Statistical physics and network-based approaches for elucidating molecular biomarkers of COPD
阐明 COPD 分子生物标志物的统计物理学和基于网络的方法
- 批准号:
10559835 - 财政年份:2023
- 资助金额:
$ 50万 - 项目类别:
New statistical approaches to mapping the functional impact of HLA alleles in multimodal complex disease datasets
绘制多模式复杂疾病数据集中 HLA 等位基因功能影响的新统计方法
- 批准号:
2748611 - 财政年份:2022
- 资助金额:
$ 50万 - 项目类别:
Studentship
CAREER: New Statistical Approaches for Studying Evolutionary Processes: Inference, Attribution and Computation
职业:研究进化过程的新统计方法:推理、归因和计算
- 批准号:
2143242 - 财政年份:2022
- 资助金额:
$ 50万 - 项目类别:
Continuing Grant
Statistical approaches to improving functional connectivity estimates with an application to autism
改善功能连接估计的统计方法及其在自闭症中的应用
- 批准号:
10598631 - 财政年份:2022
- 资助金额:
$ 50万 - 项目类别:
Innovative development of statistical and machine learning approaches for financial and actuarial risk measurement
用于财务和精算风险测量的统计和机器学习方法的创新开发
- 批准号:
22H00834 - 财政年份:2022
- 资助金额:
$ 50万 - 项目类别:
Grant-in-Aid for Scientific Research (B)
Improving Statistical Machine Learning approaches for Time-to-Event Prediction Modelling
改进事件时间预测建模的统计机器学习方法
- 批准号:
2722161 - 财政年份:2022
- 资助金额:
$ 50万 - 项目类别:
Studentship
Statistical methods for analyzing messy microbiome data: detection of hidden artifacts and robust modeling approaches
分析杂乱微生物组数据的统计方法:隐藏伪影的检测和稳健的建模方法
- 批准号:
10708908 - 财政年份:2022
- 资助金额:
$ 50万 - 项目类别:
Applications of Stochastic Machine Learning and Statistical Signal Processing Approaches to Automatic Music Transcription and Visualisation
随机机器学习和统计信号处理方法在自动音乐转录和可视化中的应用
- 批准号:
2738835 - 财政年份:2022
- 资助金额:
$ 50万 - 项目类别:
Studentship
New Bayesian statistical mechanical approaches to integrative structural biology using unassigned NMR and mass spectrometry
使用未分配的核磁共振和质谱进行综合结构生物学的新贝叶斯统计机械方法
- 批准号:
RGPIN-2022-03287 - 财政年份:2022
- 资助金额:
$ 50万 - 项目类别:
Discovery Grants Program - Individual
Hedging of longevity/mortality risks and statistical approaches to modelling mortality rates
长寿/死亡风险的对冲和死亡率建模的统计方法
- 批准号:
RGPIN-2019-06782 - 财政年份:2022
- 资助金额:
$ 50万 - 项目类别:
Discovery Grants Program - Individual