权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Modeling, Inference, and Optimization for Genomic and Biomedical Big Data

基因组和生物医学大数据的建模、推理和优化

基本信息

批准号：
10438722
负责人：
Kenneth L Lange
金额：
$ 53.92万
依托单位：
UNIVERSITY OF CALIFORNIA LOS ANGELES
依托单位国家：
美国
项目类别：
财政年份：
2021
资助国家：
美国
起止时间：
2021-07-01 至 2026-05-31
项目状态：
未结题

项目摘要

Abstract The biomedical sciences are drowning in big data. Progress in ﬁelds such as genomics and medical imaging is being stymied by the lack of ap- propriate computational tools. This grant promotes the development of algorithms, statistical methods, and software for the analysis of the big datasets encountered in the biomedical sciences. The NIH All of Us Pro- gram, the Million Veteran Project (MVP) sponsored by US Department of Veterans Affairs (VA), and the UK Biobank are three prime examples of recent massive datasets. These datasets require terabytes of storage on sample sizes ranging from 105 to 106 and above subjects. The datasets are also dynamic, growing over time in size and complexity. In addition, the datasets are heterogeneous; for example, the UK Biobank offers ge- nomic data, electronic health record (EHR) data, and imaging data on the same study individuals. Finally, as with most real-world data, the data are fraught with missingness and inaccuracy. We propose attacking the issues of parameter estimation and model selection raised by such massive datasets. We will be guided by princi- ples of parsimony and high-dimensional optimization. Most of the speciﬁc applications we have in mind involve imaging and genomics, particularly genomewide association discovery. Fortunately, most of the tools and soft- ware we construct will be more generically useful. Our successful algo- rithms will be coded in the modern scientiﬁc programming language Julia and posted on publicly available websites. We will focus on constrained and sparse regression, EM and MM algorithms for optimization, variance components models, bootstrapping of linear mixed models, a copula-like model for correlated data, and sensitivity analysis in epidemic models. These are all subjects of paramount importance in modern genomics, bio- statistics and data mining.

摘要生物医学科学正在淹没在大数据中。ﬁ领域的进展由于缺乏AP-1，基因组学和医学成像正受到阻碍。精选计算工具。这笔赠款促进了用于大数据分析的算法、统计方法和软件生物医学科学中遇到的数据集。美国国立卫生研究院所有人都支持- 由美国农业部发起的百万退伍军人项目(MVP) 退伍军人事务部(VA)和英国生物库是三个主要的例子最近的海量数据集。这些数据集需要数TB的存储空间样本量从105人到106人及以上。数据集也是动态的，随着时间的推移，其规模和复杂性都在增长。此外, 数据集是不同的；例如，英国生物库提供通用的- 名称数据、电子健康记录(EHR)数据和图像数据相同的研究对象。最后，与大多数真实世界的数据一样，数据是充满了缺失和不准确。提出了解决参数估计和模型问题的方法如此庞大的数据集引起了人们的选择。我们将以原则为指导- 简约的PLES和高维优化。ﬁc中的大多数种我们心目中的应用涉及成像和基因组学，特别是全基因组关联发现。幸运的是，大多数工具和软件- 我们构建的Ware将更具通用性。我们的成功算法- Rithms将用现代科学的ﬁc编程语言julia编写。并发布在公开的网站上。我们将重点关注受约束的以及稀疏回归、EM和MM优化算法、方差组件模型，自举的线性混合模型，一个类似Copula的相关数据模型，以及流行病模型中的敏感性分析。这些都是现代基因组学中最重要的主题，生物- 统计和数据挖掘。