权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Adaptive Reproducible High-Dimensional Nonlinear Inference for Big Biological Data

生物大数据的自适应可再现高维非线性推理

基本信息

批准号：
9753295
负责人：
Yingying Fan
金额：
$ 27.99万
依托单位：
UNIVERSITY OF SOUTHERN CALIFORNIA
依托单位国家：
美国
项目类别：
财政年份：
2018
资助国家：
美国
起止时间：
2018-08-01 至 2022-04-30
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/9753295
关键词：
Address Algorithms Archaea Attention Bacteria Big Data Biological Bypass Cells Colorectal Cancer Complex Computer software Consult Coupled Data Data Set Development Dimensions Disease Ecosystem Effectiveness Environment Foundations Frequencies Gaussian model Genes Genetic Materials Genomics Healthcare Human Internet Investigation Joints Length Linear Regressions Literature Liver Cirrhosis Machine Learning Marines Mathematics Metagenomics Methods Modeling Modernization Molecular Molecular Sequence Data Mutation Neurosciences Non-Insulin-Dependent Diabetes Mellitus Non-linear Models Obesity Organism Performance Planet Earth Play Procedures Reproducibility Reproducibility of Results Research Research Personnel Role Sampling Sampling Studies Shotguns Social Sciences Testing Theoretical Studies Tissues Training Viral Virus Visualization software Work base biological research computerized tools contig dark matter deep learning deep learning algorithm design flexibility high dimensionality human disease human tissue improved interest learning strategy metagenomic sequencing microbial community microbiome microbiome research model design model development new technology novel power analysis response simulation theories trait user-friendly virus host interaction virus identification

项目摘要

Big data is now ubiquitous in every field of modern scientific research. Many contemporary applications, such as the recent national microbiome initiative (NMI), greatly demand highly flexible statistical machine learning methods that can produce both interpretable and reproducible results. Thus, it is of paramount importance to identify crucial causal factors that are responsible for the response from a large number of available covariates, which can be statistically formulated as the false discovery rate (FDR) control in general high-dimensional nonlinear models. Despite the enormous applications of shotgun metagenomic studies, most existing investigations concentrate on the study of bacterial organisms. However, viruses and virus-host interactions play important roles in controlling the functions of the microbial communities. In addition, viruses have been shown to be associated with complex diseases. Yet, investigations into the roles of viruses in human diseases are significantly underdeveloped. The objective of this proposal is to develop mathematically rigorous and computationally efficient approaches to deal with highly complex big data and the applications of these approaches to solve fundamental and important biological and biomedical problems. There are four interrelated aims. In Aim 1, we will theoretically investigate the power of the recently proposed model-free knockoffs (MFK) procedure, which has been theoretically justified to control FDR in arbitrary models and arbitrary dimensions. We will also theoretically justify the robustness of MFK with respect to the misspecification of covariate distribution. These studies will lay the foundations for our developments in other aims. In Aim 2, we will develop deep learning approaches to predict viral contigs with higher accuracy, integrate our new algorithm with MFK to achieve FDR control for virus motif discovery, and investigate the power and robustness of our new procedure. In Aim 3, we will take into account the virus-host motif interactions and adapt our algorithms and theories in Aim 2 for predicting virus-host infectious interaction status. In Aim 4, we will apply the developed methods from the first three aims to analyze the shotgun metagenomics data sets in ExperimentHub to identify viruses and virus-host interactions associated with several diseases at some target FDR level. Both the algorithms and results will be disseminated through the web. The results from this study will be important for metagenomics studies under a variety of environments.

大数据如今在现代科学研究的各个领域中无处不在。许多当代应用，例如最近的国家微生物组计划（NMI），极大地需要高度灵活的统计机可以产生可解释和可重复结果的学习方法。因此，它是最重要的确定造成大量反应的关键因果因素非常重要可用的协变量，可以统计地表示为错误发现率（FDR）控制一般高维非线性模型。尽管鸟枪法宏基因组有巨大的应用研究中，大多数现有研究集中在细菌有机体的研究上。然而，病毒病毒与宿主的相互作用在控制微生物群落的功能方面发挥着重要作用。在此外，病毒已被证明与复杂的疾病有关。然而，调查病毒在人类疾病中的作用还远未得到充分研究。该提案的目的是开发数学严谨且计算高效的方法来处理高度复杂的大数据数据以及这些方法的应用来解决基本和重要的生物和生物医学问题。有四个相互关联的目标。在目标 1 中，我们将从理论上研究功率最近提出的无模型仿冒（MFK）程序，该程序在理论上已被证明是合理的控制任意模型和任意维度的 FDR。我们还将从理论上证明稳健性 MFK 关于协变量分布的错误指定。这些研究将奠定基础为了我们在其他目标上的发展。在目标 2 中，我们将开发深度学习方法来预测病毒 contigs 具有更高的准确度，将我们的新算法与 MFK 相结合，实现对病毒基序的 FDR 控制发现，并研究我们新程序的威力和稳健性。在目标 3 中，我们将考虑考虑病毒-宿主基序相互作用并调整我们在目标 2 中的算法和理论进行预测病毒-宿主感染相互作用状态。在目标 4 中，我们将应用前三个方法中开发的方法旨在分析 ExperimentHub 中的鸟枪法宏基因组数据集，以识别病毒和病毒宿主在某些目标 FDR 水平上与多种疾病相关的相互作用。算法和结果将通过网络传播。这项研究的结果对于宏基因组学很重要在各种环境下学习。