Modeling, Inference, and Optimization for Genomic and Biomedical Big Data

基因组和生物医学大数据的建模、推理和优化

基本信息

批准号：
10633126
负责人：
Kenneth L Lange
金额：
$ 53.92万
依托单位：
UNIVERSITY OF CALIFORNIA LOS ANGELES
依托单位国家：
美国
项目类别：
财政年份：
2021
资助国家：
美国
起止时间：
2021-07-01 至 2026-05-31
项目状态：
未结题

项目摘要

Abstract The biomedical sciences are drowning in big data. Progress in ﬁelds such as genomics and medical imaging is being stymied by the lack of ap- propriate computational tools. This grant promotes the development of algorithms, statistical methods, and software for the analysis of the big datasets encountered in the biomedical sciences. The NIH All of Us Pro- gram, the Million Veteran Project (MVP) sponsored by US Department of Veterans Affairs (VA), and the UK Biobank are three prime examples of recent massive datasets. These datasets require terabytes of storage on sample sizes ranging from 105 to 106 and above subjects. The datasets are also dynamic, growing over time in size and complexity. In addition, the datasets are heterogeneous; for example, the UK Biobank offers ge- nomic data, electronic health record (EHR) data, and imaging data on the same study individuals. Finally, as with most real-world data, the data are fraught with missingness and inaccuracy. We propose attacking the issues of parameter estimation and model selection raised by such massive datasets. We will be guided by princi- ples of parsimony and high-dimensional optimization. Most of the speciﬁc applications we have in mind involve imaging and genomics, particularly genomewide association discovery. Fortunately, most of the tools and soft- ware we construct will be more generically useful. Our successful algo- rithms will be coded in the modern scientiﬁc programming language Julia and posted on publicly available websites. We will focus on constrained and sparse regression, EM and MM algorithms for optimization, variance components models, bootstrapping of linear mixed models, a copula-like model for correlated data, and sensitivity analysis in epidemic models. These are all subjects of paramount importance in modern genomics, bio- statistics and data mining.

抽象的生物医学科学陷入了大数据中。领域的进展由于缺乏基因组学和医学成像的困扰规范计算工具。这项赠款促进了用于分析大的算法，统计方法和软件在生物医学科学中遇到的数据集。我们所有人都在支持克，美国部门赞助的百万退伍军人项目（MVP）退伍军人事务（VA）和英国生物库是三个主要例子最近的大型数据集。这些数据集需要在样本量范围为105至106及以上受试者。数据集也是动态的，大小和复杂性随着时间的流逝而增长。此外，数据集是异质的；例如，英国生物银行提供GE- 提名数据，电子健康记录（EHR）数据和成像数据同一个研究人员。最后，与大多数实际数据一样，数据是失踪和不准确。我们提出攻击参数估计和模型的问题通过此类庞大的数据集提出的选择。我们将受到原则的指导简约和高维优化的元素。大多数特定的我们想到的应用涉及成像和基因组学，特别是全基因组协会发现。幸运的是，大多数工具和软工具我们构建的商品将更加普遍。我们成功的算法 - RITHM将在现代科学编程语言Julia中编码并发布在公开的网站上。我们将专注于受约束的以及稀疏回归，EM和MM算法以进行优化，方差组件模型，线性混合模型的引导，类似于副群相关数据的模型和流行模型中的灵敏度分析。这些都是现代基因组学中至关重要的主题，生物学统计和数据挖掘。

项目成果

期刊论文数量（9）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

ORTHOGONAL TRACE-SUM MAXIMIZATION: TIGHTNESS OF THE SEMIDEFINITE RELAXATION AND GUARANTEE OF LOCALLY OPTIMAL SOLUTIONS.

正交迹和最大化：半定松弛的严格性和局部最优解的保证。

DOI：
10.1137/21m1422707
发表时间：
2022
期刊：
SIAM journal on optimization : a publication of the Society for Industrial and Applied Mathematics
影响因子：
0
作者：
Won,Joong-Ho;Zhang,Teng;Zhou,Hua
通讯作者：
Zhou,Hua

MM optimization: Proximal distance algorithms, path following, and trust regions.

DOI：
10.1073/pnas.2303168120
发表时间：
2023-07-04
期刊：
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
影响因子：
11.1
作者：
Landeros, Alfonso;Xu, Jason;Lange, Kenneth
通讯作者：
Lange, Kenneth

Bayesian Trend Filtering via Proximal Markov Chain Monte Carlo