Compact Data Structures for Computational Genomics
计算基因组学的紧凑数据结构
基本信息
- 批准号:RGPIN-2020-07185
- 负责人:
- 金额:$ 2.48万
- 依托单位:
- 依托单位国家:加拿大
- 项目类别:Discovery Grants Program - Individual
- 财政年份:2020
- 资助国家:加拿大
- 起止时间:2020-01-01 至 2021-12-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
The largest genomic databases now occupy hundreds of terabytes uncompressed. They are an invaluable resource for researchers and physicians but a challenge for computer scientists because standard tools become impractical at such scales. We can compress these databases extremely well because they are repetitive, but we must still modify our tools to work with their compressed representations.
For example, mapping assembly is a basic task in bioinformatics: indexing a reference genome we have already assembled and then, for each read of a new genome, quickly finding the substrings of the reference it matches most closely. Even when a genome is 99.9% the same as the reference, though, genomes are so big that the 0.1% difference is enough to cause many thousands of reads to remain unmapped. It helps significantly to index a "pan-genome" reference representing many genomes but popular tools for read-mapping cannot handle databases containing more than a few genomes, because they do not take advantage of the genomes' similarity.
We recently devised a tool called the r-index that can index hundreds or thousands of human genomes using reasonable memory space. We have released a basic practical implementation, which will serve as a starting point for this project. Adding new functionalities to the r-index is more difficult than adding them to current read-mappers, however, because now we need to compress even small auxiliary data structures.
First, we will extend the r-index to find maximal exact matches and combine it with recent software that stores a pan-genome as a variation graph. At the moment that software tries to index a graph but it can just as well index a genomic database and store a mapping from each genome to a path in the graph. This way, given a read, we first map it to matching substrings in the genomes and then map those to paths in the variation graph.
Second, we will extend the r-index such that, when there are many matches for a read in our genomic database that all map to the same path in the graph, it will return only summary statistics instead of reporting them individually. Such queries are essentially document listing and document counting over highly repetitive collections, something I have previously studied.
Third, we will investigate further Wheeler graphs, a framework for data structures based on the Burrows-Wheeler Transform. Such data structures include the FM-index underlying popular read-mappers; some compact representations of de Bruijn graphs; and variation-graph indexes. Our framework has already inspired a method for further compressing the r-index and for compressed indexing of readsets.
We will work to integrate practical implementations of an extended r-index into commercial and research sequencing pipelines. I am confident this project will help computer scientists meet some of the challenges posed by the flood of genomic data, and thus help biologists and physicians realize its potential.
最大的基因组数据库现在占据了数百TB的未压缩数据。它们对于研究人员和医生来说是一种宝贵的资源,但对于计算机科学家来说却是一个挑战,因为标准工具在这种规模下变得不切实际。我们可以很好地压缩这些数据库,因为它们是重复的,但是我们仍然必须修改我们的工具来处理它们的压缩表示。
例如,映射组装是生物信息学中的一项基本任务:对我们已经组装的参考基因组进行索引,然后,对于新基因组的每个读取,快速找到它最匹配的参考子串。然而,即使基因组与参考基因组有99.9%的相同性,基因组也是如此之大,以至于0.1%的差异足以导致数千个读数保持未映射。它对索引代表许多基因组的“泛基因组”参考有很大帮助,但用于读取映射的流行工具不能处理包含多个基因组的数据库,因为它们没有利用基因组的相似性。
我们最近设计了一个称为r-index的工具,可以使用合理的内存空间索引数百或数千个人类基因组。我们已经发布了一个基本的实际实现,它将作为这个项目的起点。然而,向r-index添加新功能比向当前的read-mapper添加新功能更困难,因为现在我们甚至需要压缩很小的辅助数据结构。
首先,我们将扩展r-索引以找到最大精确匹配,并将其与最近将泛基因组存储为变异图的软件联合收割机相结合。目前,软件试图索引一个图,但它也可以索引一个基因组数据库,并存储从每个基因组到图中路径的映射。这样,给定一个read,我们首先将其映射到基因组中的匹配子串,然后将其映射到变异图中的路径。
其次,我们将扩展r索引,以便当我们的基因组数据库中有许多匹配的读段都映射到图中的相同路径时,它将只返回汇总统计数据,而不是单独报告它们。 这种查询本质上是对高度重复的集合进行文档列表和文档计数,我以前研究过。
第三,我们将进一步研究惠勒图,一个基于Burrows-惠勒变换的数据结构框架。这样的数据结构包括流行的读映射器底层的FM索引; de Bruijn图的一些紧凑表示;以及变分图索引。我们的框架已经启发了一种方法,用于进一步压缩r索引和压缩读集索引。
我们将努力将扩展的r索引的实际实现整合到商业和研究测序管道中。我相信这个项目将帮助计算机科学家应对基因组数据洪流带来的一些挑战,从而帮助生物学家和医生实现其潜力。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Gagie, Travis其他文献
Bidirectional Variable-Order de Bruijn Graphs
- DOI:
10.1142/s0129054118430037 - 发表时间:
2018-12-01 - 期刊:
- 影响因子:0.8
- 作者:
Belazzougui, Djamal;Gagie, Travis;Puglisi, Simon J. - 通讯作者:
Puglisi, Simon J.
SPUMONI 2: improved classification using a pangenome index of minimizer digests.
- DOI:
10.1186/s13059-023-02958-1 - 发表时间:
2023-05-18 - 期刊:
- 影响因子:12.3
- 作者:
Ahmed, Omar Y.;Rossi, Massimiliano;Gagie, Travis;Boucher, Christina;Langmead, Ben - 通讯作者:
Langmead, Ben
Compressing and Indexing Aligned Readsets
压缩和索引对齐的读取集
- DOI:
10.4230/lipics.wabi.2021.13 - 发表时间:
2021 - 期刊:
- 影响因子:0
- 作者:
Gagie, Travis;Gourdel, Garance;Manzini, Giovanni - 通讯作者:
Manzini, Giovanni
Refining the r-index
- DOI:
10.1016/j.tcs.2019.08.005 - 发表时间:
2020-04-06 - 期刊:
- 影响因子:1.1
- 作者:
Bannai, Hideo;Gagie, Travis;Tomohiro, I - 通讯作者:
Tomohiro, I
New algorithms on wavelet trees and applications to information retrieval
- DOI:
10.1016/j.tcs.2011.12.002 - 发表时间:
2012-04-06 - 期刊:
- 影响因子:1.1
- 作者:
Gagie, Travis;Navarro, Gonzalo;Puglisi, Simon J. - 通讯作者:
Puglisi, Simon J.
Gagie, Travis的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Gagie, Travis', 18)}}的其他基金
Compact Data Structures for Computational Genomics
计算基因组学的紧凑数据结构
- 批准号:
RGPIN-2020-07185 - 财政年份:2022
- 资助金额:
$ 2.48万 - 项目类别:
Discovery Grants Program - Individual
Compact Data Structures for Computational Genomics
计算基因组学的紧凑数据结构
- 批准号:
RGPIN-2020-07185 - 财政年份:2021
- 资助金额:
$ 2.48万 - 项目类别:
Discovery Grants Program - Individual
Compact Data Structures for Computational Genomics
计算基因组学的紧凑数据结构
- 批准号:
DGECR-2020-00311 - 财政年份:2020
- 资助金额:
$ 2.48万 - 项目类别:
Discovery Launch Supplement
相似国自然基金
Scalable Learning and Optimization: High-dimensional Models and Online Decision-Making Strategies for Big Data Analysis
- 批准号:
- 批准年份:2024
- 资助金额:万元
- 项目类别:合作创新研究团队
Data-driven Recommendation System Construction of an Online Medical Platform Based on the Fusion of Information
- 批准号:
- 批准年份:2024
- 资助金额:万元
- 项目类别:外国青年学者研究基金项目
Development of a Linear Stochastic Model for Wind Field Reconstruction from Limited Measurement Data
- 批准号:
- 批准年份:2020
- 资助金额:40 万元
- 项目类别:
基于Linked Open Data的Web服务语义互操作关键技术
- 批准号:61373035
- 批准年份:2013
- 资助金额:77.0 万元
- 项目类别:面上项目
Molecular Interaction Reconstruction of Rheumatoid Arthritis Therapies Using Clinical Data
- 批准号:31070748
- 批准年份:2010
- 资助金额:34.0 万元
- 项目类别:面上项目
高维数据的函数型数据(functional data)分析方法
- 批准号:11001084
- 批准年份:2010
- 资助金额:16.0 万元
- 项目类别:青年科学基金项目
染色体复制负调控因子datA在细胞周期中的作用
- 批准号:31060015
- 批准年份:2010
- 资助金额:25.0 万元
- 项目类别:地区科学基金项目
Computational Methods for Analyzing Toponome Data
- 批准号:60601030
- 批准年份:2006
- 资助金额:17.0 万元
- 项目类别:青年科学基金项目
相似海外基金
CAREER: Data Structures and Streaming Algorithms
职业:数据结构和流算法
- 批准号:
2339942 - 财政年份:2024
- 资助金额:
$ 2.48万 - 项目类别:
Continuing Grant
Deep Learning for 3-D reconstruction of heterogeneous molecular structures from Cryo-EM data
利用冷冻电镜数据进行异质分子结构 3D 重建的深度学习
- 批准号:
BB/Y513878/1 - 财政年份:2024
- 资助金额:
$ 2.48万 - 项目类别:
Research Grant
SHF: Small: Modular Automated Verification of Concurrent Data Structures
SHF:小型:并发数据结构的模块化自动验证
- 批准号:
2304758 - 财政年份:2023
- 资助金额:
$ 2.48万 - 项目类别:
Standard Grant
Fully automated protein NMR assignments and structures from raw time-domain data by deep learning
通过深度学习根据原始时域数据全自动进行蛋白质 NMR 分配和结构
- 批准号:
23K05660 - 财政年份:2023
- 资助金额:
$ 2.48万 - 项目类别:
Grant-in-Aid for Scientific Research (C)
Enhancement of low-dimensional embedding methods for complex structures in spatiotemporal data with their applications
时空数据中复杂结构的低维嵌入方法及其应用的增强
- 批准号:
23K11018 - 财政年份:2023
- 资助金额:
$ 2.48万 - 项目类别:
Grant-in-Aid for Scientific Research (C)
CAREER: Learning of graph diffusion and transport from high dimensional data with low-dimensional structures
职业:从具有低维结构的高维数据中学习图扩散和传输
- 批准号:
2237842 - 财政年份:2023
- 资助金额:
$ 2.48万 - 项目类别:
Continuing Grant
IIBR Informatics: Mixture model algorithms for inferring covariance structures and microbial associations from microbiome data
IIBR 信息学:用于从微生物组数据推断协方差结构和微生物关联的混合模型算法
- 批准号:
2400009 - 财政年份:2023
- 资助金额:
$ 2.48万 - 项目类别:
Standard Grant
Collaborative Research: DMREF: Data-Driven Prediction of Hybrid Organic-Inorganic Structures
合作研究:DMREF:混合有机-无机结构的数据驱动预测
- 批准号:
2323547 - 财政年份:2023
- 资助金额:
$ 2.48万 - 项目类别:
Continuing Grant
CRII:OAC:A Data-Driven Closed-Loop Platform for Optimal Design of Deployable Pin-Jointed Structures
CRII:OAC:用于可展开销接结构优化设计的数据驱动闭环平台
- 批准号:
2335692 - 财政年份:2023
- 资助金额:
$ 2.48万 - 项目类别:
Standard Grant
Collaborative Research: DMREF: Data-Driven Prediction of Hybrid Organic-Inorganic Structures
合作研究:DMREF:混合有机-无机结构的数据驱动预测
- 批准号:
2323548 - 财政年份:2023
- 资助金额:
$ 2.48万 - 项目类别:
Continuing Grant














{{item.name}}会员




