权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Adaptive Reproducible High-Dimensional Nonlinear Inference for Big Biological Data

生物大数据的自适应可再现高维非线性推理

基本信息

批准号：
9923688
负责人：
Yingying Fan
金额：
$ 27.67万
依托单位：
UNIVERSITY OF SOUTHERN CALIFORNIA
依托单位国家：
美国
项目类别：
财政年份：
2018
资助国家：
美国
起止时间：
2018-08-01 至 2022-04-30
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/9923688
关键词：
Address Algorithms Archaea Attention Bacteria Big Data Biological Bypass Cells Colorectal Cancer Complex Computer software Consult Coupled Data Data Set Development Dimensions Disease Ecosystem Effectiveness Environment Foundations Frequencies Gaussian model Genes Genetic Materials Genomics Healthcare Human Internet Investigation Joints Length Linear Regressions Literature Liver Cirrhosis Mathematics Metagenomics Methods Modeling Modernization Molecular Molecular Sequence Data Mutation Neurosciences Non-Insulin-Dependent Diabetes Mellitus Non-linear Models Obesity Organism Performance Planet Earth Play Procedures Reproducibility Reproducibility of Results Research Research Personnel Role Sampling Sampling Studies Shotguns Social Sciences Testing Theoretical Studies Tissues Training Viral Virus Visualization software Work base biological research computerized tools contig dark matter deep learning deep learning algorithm design flexibility high dimensionality human disease human tissue improved interest learning strategy machine learning method metagenomic sequencing microbial community microbiome microbiome research model design model development new technology novel power analysis response simulation statistical and machine learning theories trait user-friendly virus host interaction virus identification

项目摘要

Big data is now ubiquitous in every field of modern scientific research. Many contemporary applications, such as the recent national microbiome initiative (NMI), greatly demand highly flexible statistical machine learning methods that can produce both interpretable and reproducible results. Thus, it is of paramount importance to identify crucial causal factors that are responsible for the response from a large number of available covariates, which can be statistically formulated as the false discovery rate (FDR) control in general high-dimensional nonlinear models. Despite the enormous applications of shotgun metagenomic studies, most existing investigations concentrate on the study of bacterial organisms. However, viruses and virus-host interactions play important roles in controlling the functions of the microbial communities. In addition, viruses have been shown to be associated with complex diseases. Yet, investigations into the roles of viruses in human diseases are significantly underdeveloped. The objective of this proposal is to develop mathematically rigorous and computationally efficient approaches to deal with highly complex big data and the applications of these approaches to solve fundamental and important biological and biomedical problems. There are four interrelated aims. In Aim 1, we will theoretically investigate the power of the recently proposed model-free knockoffs (MFK) procedure, which has been theoretically justified to control FDR in arbitrary models and arbitrary dimensions. We will also theoretically justify the robustness of MFK with respect to the misspecification of covariate distribution. These studies will lay the foundations for our developments in other aims. In Aim 2, we will develop deep learning approaches to predict viral contigs with higher accuracy, integrate our new algorithm with MFK to achieve FDR control for virus motif discovery, and investigate the power and robustness of our new procedure. In Aim 3, we will take into account the virus-host motif interactions and adapt our algorithms and theories in Aim 2 for predicting virus-host infectious interaction status. In Aim 4, we will apply the developed methods from the first three aims to analyze the shotgun metagenomics data sets in ExperimentHub to identify viruses and virus-host interactions associated with several diseases at some target FDR level. Both the algorithms and results will be disseminated through the web. The results from this study will be important for metagenomics studies under a variety of environments.

大数据现在无处不在现代科学研究的各个领域。许多当代应用，例如最近的国家微生物组计划（NMI），极大地要求高度灵活的统计机器学习可以产生可解释和可重复结果的方法。因此，重要的是，要确定造成大量可用的协变量，可以在统计学上用公式表示为一般的高维非线性模型。尽管鸟枪宏基因组学有着巨大的应用尽管有许多研究，但大多数现有的研究集中在细菌有机体的研究上。然而，病毒病毒与宿主的相互作用在控制微生物群落功能方面起着重要作用。在此外，病毒已被证明与复杂疾病有关。然而，对病毒在人类疾病中的作用还远远没有得到充分的研究。这项建议的目的是开发数学上严格和计算效率高的方法来处理高度复杂的大规模数据和这些方法的应用，以解决基本和重要的生物和生物医学问题。有四个相互关联的目标。在目标1中，我们将从理论上研究最近提出的无模型仿制品（MFK）程序，理论上已经证明，控制任意模型和任意尺寸的FDR。我们还将从理论上证明其稳健性 MFK关于协变量分布的误指定。这些研究将奠定基础我们在其他方面的发展。在目标2中，我们将开发深度学习方法来预测病毒将新算法与MFK算法相结合，实现了对病毒模体的FDR控制发现，并调查我们的新程序的能力和鲁棒性。在目标3中，我们将考虑到病毒-宿主基序相互作用，并调整我们在Aim 2中的算法和理论，病毒-宿主感染相互作用状态。在目标4中，我们将应用前三个开发的方法旨在分析ExperimentHub中的鸟枪宏基因组学数据集，以识别病毒和病毒宿主在某些目标FDR水平上与几种疾病相关的相互作用。算法和结果将通过网络传播。这项研究的结果将对宏基因组学非常重要在各种环境下进行研究。