Compact Data Structures for Computational Genomics

计算基因组学的紧凑数据结构

基本信息

  • 批准号:
    RGPIN-2020-07185
  • 负责人:
  • 金额:
    $ 2.48万
  • 依托单位:
  • 依托单位国家:
    加拿大
  • 项目类别:
    Discovery Grants Program - Individual
  • 财政年份:
    2021
  • 资助国家:
    加拿大
  • 起止时间:
    2021-01-01 至 2022-12-31
  • 项目状态:
    已结题

项目摘要

The largest genomic databases now occupy hundreds of terabytes uncompressed. They are an invaluable resource for researchers and physicians but a challenge for computer scientists because standard tools become impractical at such scales. We can compress these databases extremely well because they are repetitive, but we must still modify our tools to work with their compressed representations. For example, mapping assembly is a basic task in bioinformatics: indexing a reference genome we have already assembled and then, for each read of a new genome, quickly finding the substrings of the reference it matches most closely. Even when a genome is 99.9% the same as the reference, though, genomes are so big that the 0.1% difference is enough to cause many thousands of reads to remain unmapped. It helps significantly to index a "pan-genome" reference representing many genomes but popular tools for read-mapping cannot handle databases containing more than a few genomes, because they do not take advantage of the genomes' similarity. We recently devised a tool called the r-index that can index hundreds or thousands of human genomes using reasonable memory space. We have released a basic practical implementation, which will serve as a starting point for this project. Adding new functionalities to the r-index is more difficult than adding them to current read-mappers, however, because now we need to compress even small auxiliary data structures. First, we will extend the r-index to find maximal exact matches and combine it with recent software that stores a pan-genome as a variation graph. At the moment that software tries to index a graph but it can just as well index a genomic database and store a mapping from each genome to a path in the graph. This way, given a read, we first map it to matching substrings in the genomes and then map those to paths in the variation graph. Second, we will extend the r-index such that, when there are many matches for a read in our genomic database that all map to the same path in the graph, it will return only summary statistics instead of reporting them individually. Such queries are essentially document listing and document counting over highly repetitive collections, something I have previously studied. Third, we will investigate further Wheeler graphs, a framework for data structures based on the Burrows-Wheeler Transform. Such data structures include the FM-index underlying popular read-mappers; some compact representations of de Bruijn graphs; and variation-graph indexes. Our framework has already inspired a method for further compressing the r-index and for compressed indexing of readsets. We will work to integrate practical implementations of an extended r-index into commercial and research sequencing pipelines. I am confident this project will help computer scientists meet some of the challenges posed by the flood of genomic data, and thus help biologists and physicians realize its potential.
目前最大的基因组数据库占用了数百tb未压缩的数据。它们对研究人员和医生来说是宝贵的资源,但对计算机科学家来说是一个挑战,因为标准工具在这种规模下变得不切实际。我们可以很好地压缩这些数据库,因为它们是重复的,但我们仍然必须修改我们的工具来处理它们的压缩表示。例如,图谱组装是生物信息学中的一项基本任务:为我们已经组装好的参考基因组建立索引,然后,对于每个新基因组的读取,快速找到与之最匹配的参考子串。即使一个基因组与参考基因组有99.9%的相同,但基因组是如此之大,0.1%的差异足以导致成千上万的读取未被绘制。它对代表许多基因组的“泛基因组”参考索引有很大帮助,但流行的读取映射工具无法处理包含多个基因组的数据库,因为它们没有利用基因组的相似性。我们最近设计了一种叫做r-index的工具,它可以在合理的存储空间内索引数百或数千个人类基因组。我们已经发布了一个基本的实际实现,它将作为这个项目的起点。然而,向r索引添加新功能要比向当前的读映射器添加新功能困难得多,因为现在我们甚至需要压缩很小的辅助数据结构。首先,我们将扩展r-index以找到最大的精确匹配,并将其与最近将泛基因组存储为变异图的软件相结合。目前,软件试图索引一个图,但它也可以索引一个基因组数据库,并存储从每个基因组到图中路径的映射。这样,给定一个读数,我们首先将其映射到基因组中匹配的子串,然后将它们映射到变异图中的路径。其次,我们将扩展r-index,这样,当我们的基因组数据库中有许多匹配的读取都映射到图中的相同路径时,它将只返回汇总统计数据,而不是单独报告它们。这种查询本质上是对高度重复的集合进行文档列表和文档计数,这是我以前研究过的。第三,我们将进一步研究惠勒图,这是一种基于Burrows-Wheeler变换的数据结构框架。这些数据结构包括流行的读映射器底层的FM-index;de Bruijn图的一些紧表示;变异图索引。我们的框架已经启发了一种进一步压缩r索引和压缩readset索引的方法。我们将努力将扩展r-index的实际应用整合到商业和研究测序管道中。我相信这个项目将帮助计算机科学家应对基因组数据洪流带来的一些挑战,从而帮助生物学家和医生认识到它的潜力。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Gagie, Travis其他文献

Bidirectional Variable-Order de Bruijn Graphs
Compressing and Indexing Aligned Readsets
压缩和索引对齐的读取集
SPUMONI 2: improved classification using a pangenome index of minimizer digests.
  • DOI:
    10.1186/s13059-023-02958-1
  • 发表时间:
    2023-05-18
  • 期刊:
  • 影响因子:
    12.3
  • 作者:
    Ahmed, Omar Y.;Rossi, Massimiliano;Gagie, Travis;Boucher, Christina;Langmead, Ben
  • 通讯作者:
    Langmead, Ben
New algorithms on wavelet trees and applications to information retrieval
  • DOI:
    10.1016/j.tcs.2011.12.002
  • 发表时间:
    2012-04-06
  • 期刊:
  • 影响因子:
    1.1
  • 作者:
    Gagie, Travis;Navarro, Gonzalo;Puglisi, Simon J.
  • 通讯作者:
    Puglisi, Simon J.
Refining the r-index
  • DOI:
    10.1016/j.tcs.2019.08.005
  • 发表时间:
    2020-04-06
  • 期刊:
  • 影响因子:
    1.1
  • 作者:
    Bannai, Hideo;Gagie, Travis;Tomohiro, I
  • 通讯作者:
    Tomohiro, I

Gagie, Travis的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Gagie, Travis', 18)}}的其他基金

Compact Data Structures for Computational Genomics
计算基因组学的紧凑数据结构
  • 批准号:
    RGPIN-2020-07185
  • 财政年份:
    2022
  • 资助金额:
    $ 2.48万
  • 项目类别:
    Discovery Grants Program - Individual
Compact Data Structures for Computational Genomics
计算基因组学的紧凑数据结构
  • 批准号:
    RGPIN-2020-07185
  • 财政年份:
    2020
  • 资助金额:
    $ 2.48万
  • 项目类别:
    Discovery Grants Program - Individual
Compact Data Structures for Computational Genomics
计算基因组学的紧凑数据结构
  • 批准号:
    DGECR-2020-00311
  • 财政年份:
    2020
  • 资助金额:
    $ 2.48万
  • 项目类别:
    Discovery Launch Supplement
PGSA
前列腺素A
  • 批准号:
    231836-2000
  • 财政年份:
    2001
  • 资助金额:
    $ 2.48万
  • 项目类别:
    Postgraduate Scholarships
PGSA/ESA
PGSA/欧空局
  • 批准号:
    231836-2000
  • 财政年份:
    2000
  • 资助金额:
    $ 2.48万
  • 项目类别:
    Postgraduate Scholarships

相似国自然基金

Scalable Learning and Optimization: High-dimensional Models and Online Decision-Making Strategies for Big Data Analysis
  • 批准号:
  • 批准年份:
    2024
  • 资助金额:
    万元
  • 项目类别:
    合作创新研究团队
Data-driven Recommendation System Construction of an Online Medical Platform Based on the Fusion of Information
  • 批准号:
  • 批准年份:
    2024
  • 资助金额:
    万元
  • 项目类别:
    外国青年学者研究基金项目
Development of a Linear Stochastic Model for Wind Field Reconstruction from Limited Measurement Data
  • 批准号:
  • 批准年份:
    2020
  • 资助金额:
    40 万元
  • 项目类别:
基于Linked Open Data的Web服务语义互操作关键技术
  • 批准号:
    61373035
  • 批准年份:
    2013
  • 资助金额:
    77.0 万元
  • 项目类别:
    面上项目
Molecular Interaction Reconstruction of Rheumatoid Arthritis Therapies Using Clinical Data
  • 批准号:
    31070748
  • 批准年份:
    2010
  • 资助金额:
    34.0 万元
  • 项目类别:
    面上项目
高维数据的函数型数据(functional data)分析方法
  • 批准号:
    11001084
  • 批准年份:
    2010
  • 资助金额:
    16.0 万元
  • 项目类别:
    青年科学基金项目
染色体复制负调控因子datA在细胞周期中的作用
  • 批准号:
    31060015
  • 批准年份:
    2010
  • 资助金额:
    25.0 万元
  • 项目类别:
    地区科学基金项目
Computational Methods for Analyzing Toponome Data
  • 批准号:
    60601030
  • 批准年份:
    2006
  • 资助金额:
    17.0 万元
  • 项目类别:
    青年科学基金项目

相似海外基金

CAREER: Data Structures and Streaming Algorithms
职业:数据结构和流算法
  • 批准号:
    2339942
  • 财政年份:
    2024
  • 资助金额:
    $ 2.48万
  • 项目类别:
    Continuing Grant
Deep Learning for 3-D reconstruction of heterogeneous molecular structures from Cryo-EM data
利用冷冻电镜数据进行异质分子结构 3D 重建的深度学习
  • 批准号:
    BB/Y513878/1
  • 财政年份:
    2024
  • 资助金额:
    $ 2.48万
  • 项目类别:
    Research Grant
SHF: Small: Modular Automated Verification of Concurrent Data Structures
SHF:小型:并发数据结构的模块化自动验证
  • 批准号:
    2304758
  • 财政年份:
    2023
  • 资助金额:
    $ 2.48万
  • 项目类别:
    Standard Grant
Fully automated protein NMR assignments and structures from raw time-domain data by deep learning
通过深度学习根据原始时域数据全自动进行蛋白质 NMR 分配和结构
  • 批准号:
    23K05660
  • 财政年份:
    2023
  • 资助金额:
    $ 2.48万
  • 项目类别:
    Grant-in-Aid for Scientific Research (C)
Enhancement of low-dimensional embedding methods for complex structures in spatiotemporal data with their applications
时空数据中复杂结构的低维嵌入方法及其应用的增强
  • 批准号:
    23K11018
  • 财政年份:
    2023
  • 资助金额:
    $ 2.48万
  • 项目类别:
    Grant-in-Aid for Scientific Research (C)
CAREER: Learning of graph diffusion and transport from high dimensional data with low-dimensional structures
职业:从具有低维结构的高维数据中学习图扩散和传输
  • 批准号:
    2237842
  • 财政年份:
    2023
  • 资助金额:
    $ 2.48万
  • 项目类别:
    Continuing Grant
CRII:OAC:A Data-Driven Closed-Loop Platform for Optimal Design of Deployable Pin-Jointed Structures
CRII:OAC:用于可展开销接结构优化设计的数据驱动闭环平台
  • 批准号:
    2335692
  • 财政年份:
    2023
  • 资助金额:
    $ 2.48万
  • 项目类别:
    Standard Grant
Collaborative Research: DMREF: Data-Driven Prediction of Hybrid Organic-Inorganic Structures
合作研究:DMREF:混合有机-无机结构的数据驱动预测
  • 批准号:
    2323547
  • 财政年份:
    2023
  • 资助金额:
    $ 2.48万
  • 项目类别:
    Continuing Grant
IIBR Informatics: Mixture model algorithms for inferring covariance structures and microbial associations from microbiome data
IIBR 信息学:用于从微生物组数据推断协方差结构和微生物关联的混合模型算法
  • 批准号:
    2400009
  • 财政年份:
    2023
  • 资助金额:
    $ 2.48万
  • 项目类别:
    Standard Grant
Graphical Modeling of High-Dimensional Functional Data: Separability Structures and Unified Methodology under General Observational Designs
高维函数数据的图形建模:一般观测设计下的可分离结构和统一方法
  • 批准号:
    2310943
  • 财政年份:
    2023
  • 资助金额:
    $ 2.48万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了