权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Scaling up computational genomics with tree sequences

用树序列扩展计算基因组学

基本信息

批准号：
10585745
负责人：
PETER Lochhead RALPH
金额：
$ 60.57万
依托单位：
UNIVERSITY OF OREGON
依托单位国家：
美国
项目类别：
财政年份：
2023
资助国家：
美国
起止时间：
2023-06-05 至 2027-03-31
项目状态：
未结题

来源：
https://reporter.nih.gov/project-details/10585745
关键词：
Address Affect Agriculture Algorithms Architecture Area Astronomy Base Sequence Collection Communities Complex Computer software Computing Methodologies Culicidae Data Data Compression Data Set Development Disease Ecology Ensure Epidemiology Etiology Evolution Genealogical Tree Generations Genetic Genetic Processes Genetic Recombination Genetic Variation Genome Genomics Genotype Goals Haplotypes Health Benefit Historical Demography Human Genetics Human Genome Individual Internet Learning Libraries Maps Mathematics Methods Modeling Modernization Mutation Performance Phase Phenotype Population Population Genetics Population Sizes Positioning Attribute Process Production Records Research Running Sample Size Sampling Statistical Data Interpretation Structure Techniques Testing Time Training Trees Validation Variant Work algorithm development computer framework cost data format data structure deep learning design frontier genome-wide genomic data human disease improved interest interoperability learning strategy member multicore processor next generation novel strategies open source operation scale up sequence learning simulation statistics success supervised learning whole genome

项目摘要

Project Summary/Abstract Increasing sample size is a tremendously important factor in building our understanding of the genetics of human disease. As we discover that more and more diseases have a complex web of genetic causation, we need larger and larger genetic datasets to disentangle them, and to ultimately produce successful therapies. Driven in part by this need, the community is now assembling vast collections of human genome sequences, and millions of samples will soon be commonplace. There is a profound problem, however: our computational methods for storing, processing, and analyzing genomic data are lagging far behind. The algorithms and data structures underlying today’s computational methods were designed for thousands of samples, not millions. Without fundamental change in how we store and process genomic data, we will either not fully tap the potential of the data we collect, or the computational costs will be astronomical – or both. Nonhuman datasets, with applications in epidemiology, ecology, evolution, and agriculture, may not reach these sample sizes soon, but here we nevertheless face a related barrier. Simulation is increasingly important for tasks from hypothesis generation to parameter inference. However, current simulation methods only scale to tens or hundreds of thousands of individuals, inappropriate for many species of interest (e.g., mosquitos). This is crucial, since evolution and ecology in large populations differs from small ones, in ways that cannot be avoided by mathematical tricks (like rescaling). Our proposal addresses these critical needs by focusing on a new data structure: the “tree sequence”, which encodes genetic variation data using the population genetics processes that produced the data itself, by representing variation among contemporary samples using the underlying genealogical trees. This yields extraordinary levels of data compression, with file sizes hundreds of times smaller than current community standards. Since the tree sequence was introduced in 2016 it has led to performance increases of 2–4 orders of magnitude in genome simulation, calculation of statistics, and ancestry inference. Such sudden leaps in computational performance are vanishingly rare, and only possible through deep algorithmic advances. Our research plan builds on the extraordinary successes of tree sequence methods so far, scaling up three crucial layers of computational genomics: analysis, simulation, and inference. First, we will continue our development of highly efficient tree-sequence-based methods for fundamental operations in statistical and population genetics. Second, we will scale up genome simulations by integrating tree sequence methods into complex forward-time simulations and utilizing modern, multicore processors. Third, we will combine efficient simulations and the rich information contained in the tree sequence with cutting-edge deep-learning techniques to develop new inference methods. Together, we aim to revolutionize the way we work with and learn from population genetic variation data.

项目摘要/摘要增加样本量是建立我们对遗传的理解的一个非常重要的因素。人类疾病。随着我们发现越来越多的疾病具有复杂的遗传原因网络，我们需要越来越大的基因数据集来解开它们，并最终产生成功的治疗方法。在一定程度上受到这种需求的推动，该社区现在正在收集大量的人类基因组序列，数以百万计的样本很快就会变得司空见惯。然而，有一个深刻的问题：我们的计算存储、处理和分析基因组数据的方法远远落后。算法和数据当今计算方法的基础结构是为数千个样本设计的，而不是数百万个。如果我们存储和处理基因组数据的方式没有根本性的改变，我们要么无法充分利用我们收集的数据的潜力，或者计算成本将是天文数字-或者两者兼而有之。在流行病学、生态学、进化论和农业中应用的非人类数据集可能无法达到这些样本数量很快就会增加，但在这里，我们仍然面临着一个相关的障碍。仿真变得越来越重要用于从假设生成到参数推理的任务。然而，目前的模拟方法仅限于规模对数万或数十万人来说，对许多感兴趣的物种(如蚊子)来说是不合适的。这一点至关重要，因为大种群的进化和生态与小种群的进化和生态不同，不能可以通过数学技巧(如重新调整比例)来避免。我们的建议通过关注一种新的数据结构来满足这些关键需求：“树序列”，它使用产生数据本身的群体遗传学过程对遗传变异数据进行编码，通过使用基本的系谱树来表示当代样本之间的差异。这就产生了超高级别的数据压缩，文件大小比当前社区小数百倍标准。自2016年引入树序列以来，性能提升了2-4个数量级在基因组模拟、统计计算和祖先推断中具有重要意义。如此突然的跳跃计算性能几乎为零，只有通过深入的算法进步才有可能实现。我们的研究计划建立在迄今为止树序列方法的非凡成功的基础上，扩大了三个计算基因组学的关键层面：分析、模拟和推断。首先，我们将继续我们的开发基于树序列的高效统计和统计基本运算方法种群遗传学。其次，我们将通过整合树序列方法来放大基因组模拟进入复杂的前向时间模拟，并利用现代多核处理器。第三，我们将结合高效的模拟和树序列中包含的丰富信息，以及前沿的深度学习开发新的推理方法的技术。我们共同努力，致力于彻底改变我们与从群体遗传变异数据中学习。