权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Causal and integrative deep learning for Alzheimer's disease genetics

阿尔茨海默病遗传学的因果和综合深度学习

基本信息

批准号：
10267373
负责人：
Wei Pan
金额：
$ 73.34万
依托单位：
UNIVERSITY OF MINNESOTA
依托单位国家：
美国
项目类别：
财政年份：
2021
资助国家：
美国
起止时间：
2021-09-15 至 2026-08-31
项目状态：
未结题

来源：
https://reporter.nih.gov/project-details/10267373
关键词：
Algorithms Alzheimer&apos s Disease Alzheimer&apos s disease risk Biological Brain Brain region Communities Complex Computer software DNA Methylation Data Data Analyses Data Set Disease Documentation Early Diagnosis Epigenetic Process Etiology Gene Expression Gene Proteins Genetic Genetic Diseases Genome Genomics Goals Image Influentials Intervention Knowledge Least-Squares Analysis Linear Models Linear Regressions Methods Modeling Molecular Molecular Target Motivation Neural Network Simulation Non-linear Models Operant Conditioning Outcome Prevention Protein Region Proteomics Public Domains Publishing Pythons Research Risk Risk Factors Sampling Systems Analysis Technology TensorFlow Therapeutic Intervention Time Tweens base causal variant cognitive system computerized tools deep learning deep neural network drug development endophenotype epigenomics flexibility functional genomics genetic association genome sequencing genome wide association study genome-wide genomic data improved insight interest learning strategy machine learning method modifiable risk molecular imaging neural network neuroimaging novel phenotypic data pleiotropism predictive modeling programs protective factors response software development statistical and machine learning therapeutic development therapy development trait transcriptome transcriptomics whole genome

项目摘要

Summary In response to PAR-19-269, “Cognitive Systems Analysis of Alzheimer's Disease Genetic and Phenotypic Data”, we propose developing and applying more powerful and robust machine learning methods for causal and integrative analysis, especially deep learning approaches for instrumental variable analysis, to identify causal risk/protective factors for Alzheimer's disease (AD) in the post-GWAS era by leveraging published large-scale GWAS, whole-genome sequencing (WGS) and other omic and neuroimaging data. Our main motivation is to ex- tend an emerging and increasingly inﬂuential approach of integrating GWAS with gene expression data, called transcriptome-wide association studies (TWAS), aiming to improve over the current practice of GWAS by not only increasing statistical power, but also identifying (putative) causal genes, thus gaining insights into the genetic basis of common diseases and complex traits. The statistical principle underlying TWAS is the (two-sample) two-stage least squares (2SLS) for linear models in the framework of instrumental variable (IV) analysis for causal inference. In practice, however, TWAS may fail to identify true causal genes while giving false positives due to the violation of its modeling assumptions, e.g., due to non-linear effects of IVs or gene expression, or due to invalid IVs (in the presence of horizontal pleiotropy of SNPs). First, we propose developing linear models and neural network models incorporating a large number of functional annotations on the genome (e.g. various types of functional genomic and epigenetic data from the ENCODE and Roadmap Epigenomics projects) as prior knowledge to improve im- puting/predicting gene expression (or other molecular or imaging endophenotypes or complex traits/diseases) via SNPs, corresponding to the ﬁrst stage of 2SLS. Second, we propose neural networks as more ﬂexible non-linear models for the second stage of 2SLS in the presence of invalid IVs, which may be the SNPs having direct (or horizontal pleiotropic) effects on the outcome as expected from the wide-spread pleiotropy. Then we combine the approaches in the above two stages to form a more ﬂexible and robust neural network approach as an extension of 2SLS for causal inference. Third, we consider inferring causal directions between two traits, e.g. a gene's expres- sion and AD, allowing non-linear relationships between SNPs and traits and between the two traits. This is critical in reducing false positives, e.g. due to reverse causation, but has been largely under-studied. Fourth, we apply the new (and existing) methods to transcriptomic, proteomic, neuroimaging and AD GWAS/WGS data to identify (pu- tative) causal genes, proteins and brain regions of interest (ROIs) for AD, while building the corresponding genetic prediction models for endophenotypes and AD risk. Finally, we will develop and disseminate publicly available software implementing the proposed analysis methods, e.g. as Python programs or R packages, to facilitate the wide use by the scientiﬁc community.

总结回应PAR-19-269，“阿尔茨海默病遗传和表型的认知系统分析数据”，我们建议开发和应用更强大和强大的机器学习方法，综合分析，特别是工具变量分析的深度学习方法，以确定因果关系在后GWAS时代，通过利用已发表的大规模 GWAS、全基因组测序（WGS）和其他组学和神经成像数据。我们的主要动机是，倾向于一种新兴的、越来越重要的将GWAS与基因表达数据整合的方法，称为全转录组关联研究（TWAS），旨在通过不仅增加统计能力，但也确定（假定）因果基因，从而获得对遗传基础的见解常见疾病和复杂特征的集合TWAS背后的统计原理是（双样本）两阶段最小二乘（2SLS）的线性模型的框架内的工具变量（IV）的因果推理分析。然而，在实践中，TWAS可能无法识别真正的因果基因，同时由于违规而给出假阳性它的建模假设，例如，由于IV或基因表达的非线性效应，或由于无效IV（在 SNP的水平多效性的存在）。首先，我们提出发展线性模型和神经网络模型在基因组上并入大量功能注释（例如，各种类型的功能基因组注释），和表观遗传学数据从ENCODE和路线图表观基因组学项目）作为先验知识，以改善免疫系统，通过以下方式预测基因表达（或其他分子或成像内表型或复杂性状/疾病）： SNP，对应于2SLS的第一阶段。其次，我们提出神经网络作为更灵活的非线性在存在无效IV的情况下，2SLS的第二阶段的模型，其可以是具有直接（或水平多效性）对结果的影响，如从广泛分布的多效性所预期的。然后我们将联合收割机方法在上述两个阶段，以形成一个更灵活和强大的神经网络的方法，作为一个扩展， 2SLS用于因果推理。第三，我们考虑推断两个性状之间的因果方向，例如，一个基因的表达，锡永和AD，允许SNP和性状之间以及两个性状之间的非线性关系。这一点至关重要减少假阳性，例如由于反向因果关系，但在很大程度上研究不足。第四，我们应用新的（和现有的）方法，转录组学，蛋白质组学，神经成像和AD GWAS/WGS数据，以确定（PU- 目的）致病基因、蛋白质和AD的脑感兴趣区域（ROI），同时构建相应的遗传标记。内表型和AD风险的预测模型。最后，我们将开发和传播公开可用的实现所提出的分析方法的软件，例如Python程序或R软件包，以促进被科学界广泛使用。