权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Computational Methods for Next-Generation GWAS

下一代 GWAS 的计算方法

基本信息

批准号：
9910009
负责人：
Christopher J Battey
金额：
$ 1.93万
依托单位：
UNIVERSITY OF OREGON
依托单位国家：
美国
项目类别：
财政年份：
2020
资助国家：
美国
起止时间：
2020-05-01 至 2020-07-31
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/9910009
关键词：
Agriculture Benchmarking Biology Breeding Communities Computer software Computing Methodologies Coupled Culicidae DNA Sequence Data Dimensions Environment Evolution Gene Frequency Genetic Genotype Geographic Locations Goals Guidelines Haplotypes Health Heart Diseases Height Human Image Learning Linear Models Linear Regressions Link Machine Learning Measures Methodology Methods Modeling Non-Insulin-Dependent Diabetes Mellitus Oligogenic Traits Output Performance Phenotype Polygenic Traits Population Population Genetics Population Heterogeneity Positioning Attribute Process Public Health Running Sampling Signal Transduction Spatial Distribution Stratification Structure Sum Techniques Testing Training Trans-Omics for Precision Medicine Variant autoencoder base biobank cohort deep learning deep neural network diverse data experience genome wide association study genome-wide genomic data human data image reconstruction improved large scale simulation learning strategy machine learning algorithm neural network next generation polygenic risk score population stratification simulation statistics supervised learning tool trait

项目摘要

Project Summary/Abstract Predicting phenotypes from DNA sequence variation is a major goal for genetics with potential applications in evolutionary biology, crop breeding, and public health. A central challenge in this task is separating genetic and environmental effects on phenotypes. In natural populations breeding structure is often correlated with the environment across space such that different subpopulations experience different environments. For genome-wide association studies (GWAS) this creates a problem: genetic and environmental effects can be confounded by population structure, leading to inflated test statistics and low predictive power across populations (Bulik-Sullivan et al. 2015, Mathieson and Mcvean, 2012). Understanding when association studies are biased by population stratification and creating better methods to correct for it are thus important challenges for population genetics over the next decade. To identify conditions under which existing methods of population stratification correction are subject to bias and develop robust new alternatives suitable for use with the continental-scale genomic datasets that are now routinely available for humans, we propose to use simulations and machine learning to separate the signals of fine-scale ancestry from polygenic phenotype association. In our first aim we will develop simulations of polygenic phenotype evolution in continuous space and use the output to evaluate existing methods of stratification control including linear mixed models, PC correction, and LD score regression. In this aim we will seek to identify the regions of parameter space – i.e. the strength of isolation by distance and the spatial distribution of environmental variation – in which existing methods can be expected to produce reliable effect size estimates, and establish guidelines for applications of GWAS to structured populations. We will then train machine learning algorithms on real genotype data from humans and mosquitoes to describe continuous structure in large spatial samples using a variational autoencoder, a dimensionality reduction technique based on deep neural networks that can take advantage of both allele frequency and haplotype-based measures of differentiation in a single analysis and thus offer improved control of stratification inflation in GWAS relative to the now standard PCA regression approach. Last we will apply deep learning techniques to the problem of linking phenotypes and genotypes in structured samples by training neural networks on simulated phenotypes and empirical genetic data. By training our networks on empirical genetic data and incorporating contextual information about surrounding haplotype structure into the model, our networks should learn to discriminate causal associations from false positives created by population structure in the sample cohort, which will improve performance when attempting to identify associations with the real phenotype. These methods will be applied to existing genomic datasets of height in humans, tested against the current state-of-the-art approaches, and packaged as scalable software for the broader scientific community.

项目摘要/摘要从DNA序列变异预测表型是有潜力的遗传学的主要目标在进化生物学、作物育种和公共卫生方面的应用。这项任务的一个核心挑战是分离遗传和环境对表型的影响。在自然种群中，繁殖结构通常是与空间中的环境相关，不同的亚群经历不同的环境。对于全基因组关联研究(GWAS)来说，这产生了一个问题：遗传和环境影响可能会被种群结构混淆，从而导致夸大的测试统计和低跨人群的预测能力(Bulik-Sullivan等人)。2015年，Mathieson和McVean，2012年)。理解当关联研究因人口分层而产生偏见并创造更好的方法来纠正它时因此，人口遗传学在未来十年将面临重大挑战。确定现有的人口分层校正方法在哪些条件下受到限制偏向并开发适合与大陆规模基因组数据集一起使用的可靠的新替代品现在对人类来说是常规的，我们建议使用模拟和机器学习来分离来自多基因表型组合的精细祖先信号。在我们的首要目标中，我们将开发模拟在连续空间中的多基因表型进化，并使用输出来评估现有的方法分层控制包括线性混合模型、PC校正和LD分数回归。为了实现这一目标，我们将寻求识别参数空间的区域--即通过距离和空间隔离的强度环境变化的分布--其中现有方法可以预期产生可靠的效果规模估计，并制定将全球气候变化系统应用于结构化人口的指导方针。然后，我们将根据来自人类和蚊子的真实基因数据训练机器学习算法，以使用变分自动编码器描述大空间样本中的连续结构，维度一种既能利用等位基因频率又能利用基因频率的深度神经网络约简技术在单一分析中基于单倍型的差异性测量，从而提供更好的分层控制与现在标准的主成分分析回归方法相比，GWA中的通货膨胀。最后，我们将应用深度学习通过训练神经网络解决结构化样本中表型和基因型关联问题的技术基于模拟表型和经验遗传数据的网络。通过对我们的网络进行经验遗传培训数据，并将有关周围单倍型结构的上下文信息整合到模型中，我们的网络应该学会区分因果关联和人口结构造成的假阳性在样本队列中，这将在尝试识别与实际的关联时提高性能表型。这些方法将应用于现有的人类身高基因组数据集，并与当前最先进的方法，并打包为可扩展的软件，供更广泛的科学界使用。