Scaling up computational genomics with tree sequences
用树序列扩展计算基因组学
基本信息
- 批准号:10585745
- 负责人:
- 金额:$ 60.57万
- 依托单位:
- 依托单位国家:美国
- 项目类别:
- 财政年份:2023
- 资助国家:美国
- 起止时间:2023-06-05 至 2027-03-31
- 项目状态:未结题
- 来源:
- 关键词:AddressAffectAgricultureAlgorithmsArchitectureAreaAstronomyBase SequenceCollectionCommunitiesComplexComputer softwareComputing MethodologiesCulicidaeDataData CompressionData SetDevelopmentDiseaseEcologyEnsureEpidemiologyEtiologyEvolutionGenealogical TreeGenerationsGeneticGenetic ProcessesGenetic RecombinationGenetic VariationGenomeGenomicsGenotypeGoalsHaplotypesHealth BenefitHistorical DemographyHuman GeneticsHuman GenomeIndividualInternetLearningLibrariesMapsMathematicsMethodsModelingModernizationMutationPerformancePhasePhenotypePopulationPopulation GeneticsPopulation SizesPositioning AttributeProcessProductionRecordsResearchRunningSample SizeSamplingStatistical Data InterpretationStructureTechniquesTestingTimeTrainingTreesValidationVariantWorkalgorithm developmentcomputer frameworkcostdata formatdata structuredeep learningdesignfrontiergenome-widegenomic datahuman diseaseimprovedinterestinteroperabilitylearning strategymembermulticore processornext generationnovel strategiesopen sourceoperationscale upsequence learningsimulationstatisticssuccesssupervised learningwhole genome
项目摘要
Project Summary/Abstract
Increasing sample size is a tremendously important factor in building our understanding of the genetics of
human disease. As we discover that more and more diseases have a complex web of genetic causation, we
need larger and larger genetic datasets to disentangle them, and to ultimately produce successful therapies.
Driven in part by this need, the community is now assembling vast collections of human genome sequences,
and millions of samples will soon be commonplace. There is a profound problem, however: our computational
methods for storing, processing, and analyzing genomic data are lagging far behind. The algorithms and data
structures underlying today’s computational methods were designed for thousands of samples, not millions.
Without fundamental change in how we store and process genomic data, we will either not fully tap the
potential of the data we collect, or the computational costs will be astronomical – or both.
Nonhuman datasets, with applications in epidemiology, ecology, evolution, and agriculture, may not reach
these sample sizes soon, but here we nevertheless face a related barrier. Simulation is increasingly important
for tasks from hypothesis generation to parameter inference. However, current simulation methods only scale
to tens or hundreds of thousands of individuals, inappropriate for many species of interest (e.g., mosquitos).
This is crucial, since evolution and ecology in large populations differs from small ones, in ways that cannot
be avoided by mathematical tricks (like rescaling).
Our proposal addresses these critical needs by focusing on a new data structure: the “tree sequence”,
which encodes genetic variation data using the population genetics processes that produced the data itself,
by representing variation among contemporary samples using the underlying genealogical trees. This yields
extraordinary levels of data compression, with file sizes hundreds of times smaller than current community
standards. Since the tree sequence was introduced in 2016 it has led to performance increases of 2–4 orders
of magnitude in genome simulation, calculation of statistics, and ancestry inference. Such sudden leaps in
computational performance are vanishingly rare, and only possible through deep algorithmic advances.
Our research plan builds on the extraordinary successes of tree sequence methods so far, scaling up three
crucial layers of computational genomics: analysis, simulation, and inference. First, we will continue our
development of highly efficient tree-sequence-based methods for fundamental operations in statistical and
population genetics. Second, we will scale up genome simulations by integrating tree sequence methods
into complex forward-time simulations and utilizing modern, multicore processors. Third, we will combine
efficient simulations and the rich information contained in the tree sequence with cutting-edge deep-learning
techniques to develop new inference methods. Together, we aim to revolutionize the way we work with and
learn from population genetic variation data.
项目摘要/摘要
增加样本量是建立我们对遗传的理解的一个非常重要的因素。
人类疾病。随着我们发现越来越多的疾病具有复杂的遗传原因网络,我们
需要越来越大的基因数据集来解开它们,并最终产生成功的治疗方法。
在一定程度上受到这种需求的推动,该社区现在正在收集大量的人类基因组序列,
数以百万计的样本很快就会变得司空见惯。然而,有一个深刻的问题:我们的计算
存储、处理和分析基因组数据的方法远远落后。算法和数据
当今计算方法的基础结构是为数千个样本设计的,而不是数百万个。
如果我们存储和处理基因组数据的方式没有根本性的改变,我们要么无法充分利用
我们收集的数据的潜力,或者计算成本将是天文数字-或者两者兼而有之。
在流行病学、生态学、进化论和农业中应用的非人类数据集可能无法达到
这些样本数量很快就会增加,但在这里,我们仍然面临着一个相关的障碍。仿真变得越来越重要
用于从假设生成到参数推理的任务。然而,目前的模拟方法仅限于规模
对数万或数十万人来说,对许多感兴趣的物种(如蚊子)来说是不合适的。
这一点至关重要,因为大种群的进化和生态与小种群的进化和生态不同,不能
可以通过数学技巧(如重新调整比例)来避免。
我们的建议通过关注一种新的数据结构来满足这些关键需求:“树序列”,
它使用产生数据本身的群体遗传学过程对遗传变异数据进行编码,
通过使用基本的系谱树来表示当代样本之间的差异。这就产生了
超高级别的数据压缩,文件大小比当前社区小数百倍
标准。自2016年引入树序列以来,性能提升了2-4个数量级
在基因组模拟、统计计算和祖先推断中具有重要意义。如此突然的跳跃
计算性能几乎为零,只有通过深入的算法进步才有可能实现。
我们的研究计划建立在迄今为止树序列方法的非凡成功的基础上,扩大了三个
计算基因组学的关键层面:分析、模拟和推断。首先,我们将继续我们的
开发基于树序列的高效统计和统计基本运算方法
种群遗传学。其次,我们将通过整合树序列方法来放大基因组模拟
进入复杂的前向时间模拟,并利用现代多核处理器。第三,我们将结合
高效的模拟和树序列中包含的丰富信息,以及前沿的深度学习
开发新的推理方法的技术。我们共同努力,致力于彻底改变我们与
从群体遗传变异数据中学习。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
PETER Lochhead RALPH其他文献
PETER Lochhead RALPH的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('PETER Lochhead RALPH', 18)}}的其他基金
Scaling up computational genomics with tree sequences
用树序列扩展计算基因组学
- 批准号:
10471496 - 财政年份:2021
- 资助金额:
$ 60.57万 - 项目类别:
相似海外基金
RII Track-4:NSF: From the Ground Up to the Air Above Coastal Dunes: How Groundwater and Evaporation Affect the Mechanism of Wind Erosion
RII Track-4:NSF:从地面到沿海沙丘上方的空气:地下水和蒸发如何影响风蚀机制
- 批准号:
2327346 - 财政年份:2024
- 资助金额:
$ 60.57万 - 项目类别:
Standard Grant
BRC-BIO: Establishing Astrangia poculata as a study system to understand how multi-partner symbiotic interactions affect pathogen response in cnidarians
BRC-BIO:建立 Astrangia poculata 作为研究系统,以了解多伙伴共生相互作用如何影响刺胞动物的病原体反应
- 批准号:
2312555 - 财政年份:2024
- 资助金额:
$ 60.57万 - 项目类别:
Standard Grant
How Does Particle Material Properties Insoluble and Partially Soluble Affect Sensory Perception Of Fat based Products
不溶性和部分可溶的颗粒材料特性如何影响脂肪基产品的感官知觉
- 批准号:
BB/Z514391/1 - 财政年份:2024
- 资助金额:
$ 60.57万 - 项目类别:
Training Grant
Graduating in Austerity: Do Welfare Cuts Affect the Career Path of University Students?
紧缩毕业:福利削减会影响大学生的职业道路吗?
- 批准号:
ES/Z502595/1 - 财政年份:2024
- 资助金额:
$ 60.57万 - 项目类别:
Fellowship
Insecure lives and the policy disconnect: How multiple insecurities affect Levelling Up and what joined-up policy can do to help
不安全的生活和政策脱节:多种不安全因素如何影响升级以及联合政策可以提供哪些帮助
- 批准号:
ES/Z000149/1 - 财政年份:2024
- 资助金额:
$ 60.57万 - 项目类别:
Research Grant
感性個人差指標 Affect-X の構築とビスポークAIサービスの基盤確立
建立个人敏感度指数 Affect-X 并为定制人工智能服务奠定基础
- 批准号:
23K24936 - 财政年份:2024
- 资助金额:
$ 60.57万 - 项目类别:
Grant-in-Aid for Scientific Research (B)
How does metal binding affect the function of proteins targeted by a devastating pathogen of cereal crops?
金属结合如何影响谷类作物毁灭性病原体靶向的蛋白质的功能?
- 批准号:
2901648 - 财政年份:2024
- 资助金额:
$ 60.57万 - 项目类别:
Studentship
ERI: Developing a Trust-supporting Design Framework with Affect for Human-AI Collaboration
ERI:开发一个支持信任的设计框架,影响人类与人工智能的协作
- 批准号:
2301846 - 财政年份:2023
- 资助金额:
$ 60.57万 - 项目类别:
Standard Grant
Investigating how double-negative T cells affect anti-leukemic and GvHD-inducing activities of conventional T cells
研究双阴性 T 细胞如何影响传统 T 细胞的抗白血病和 GvHD 诱导活性
- 批准号:
488039 - 财政年份:2023
- 资助金额:
$ 60.57万 - 项目类别:
Operating Grants
How motor impairments due to neurodegenerative diseases affect masticatory movements
神经退行性疾病引起的运动障碍如何影响咀嚼运动
- 批准号:
23K16076 - 财政年份:2023
- 资助金额:
$ 60.57万 - 项目类别:
Grant-in-Aid for Early-Career Scientists