权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

IIBR Informatics: An Efficient Pangenomics Graph Aligner

IIBR 信息学：高效的泛基因组图对齐器

基本信息

批准号：
2029552
负责人：
Christina Boucher
金额：
$ 70.04万
依托单位：
University of Florida
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2020
资助国家：
美国
起止时间：
2020-09-01 至 2024-08-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2029552&HistoricalAwards=false
关键词：
IIBR Informatics Efficient Pangenomics Graph

项目摘要

In the past decade, there has been an effort to sequence and compare the DNA of a large number of individuals of a given species, resulting in not just a single reference genome but a population of genomes of a given species. Enormous public data now are available including the 1,000 Genome Project, the 100K Genome Project, the 1001 Arabidopsis Genomes project, the Rice Genome Annotation Project, and the Bird 10,000 Genomes (B10K) Project. Key software, called short read aligners, align newly sequenced DNA fragments to one (or more) reference genome(s) in order to identify genetic variation within the species. The downstream analysis of this genetic variation finds causal relationships between complex diseases and phenotypes. Existing short read aligners are unable to align to a large number of reference genome(s), due purely to computational constraints. Hence, using a small number of genome(s) to align to reduces the memory and time constraints. Unfortunately, although there is a large percentage genetic similarity between individuals of the same species, the differences are also important and aligning to only a small number of genomes of a given species can lead to some of the DNA fragments not aligning or aligning poorly. This, in turn, makes finding genetic variation between the newly sequenced DNA fragments and the reference genome(s) more challenging. One manner to overcome this challenge is to develop new algorithms and data structures for short read alignment that reduce the computational resources. This project realizes this vision by developing a novel representation of a population of genomes, and creating the algorithms and data structures needed to build, store and update it. Thus, integrated into this project is the goal of advancing biological science and knowledge of model species, and the ideas, and furthering the development of an outreach program that supports first-generation university graduates. An immediate outcome of the work will be research opportunities to under-served students through the Machen Florida Opportunity Scholars program, an organization that aims to foster the success of first-generation university scholars. Short read aligners first build an index from one or more reference genome(s) and subsequently use it to find and extend matched subsequences between sequence reads and the reference(s). The bottleneck of using these read aligners to index thousands of genomes is the space and time needed for construct and store the index. To address the shortcomings associated with using a single reference genome, the concept of graph-based pangenomics aligners has been introduced and widely discussed in the community. Although such methods have been shown to improve on the accuracy over standard sequence-based aligners, their use has not been fully explored. The challenge that prevents the realization a pangenomics graph alignment is that of scalability. The goal of the project is to the developing algorithms that allow for the construction of a pangenomic reference from datasets gathered from large populations. In order to achieve this goal, novel means to build, compress, and update a graph that encapsulates the variation found in the population will be created and implemented. Thus, this work will require further advancements that have impact beyond the stated application. More specifically, it is unknown how to merge the r-index, represent a graph-model of references using sub-linear space, or represent the graph using the r-index. This project will address these open problems, and more broadly, connect two areas of research: succinct data structures and pangenomics. Next, the project will minimize the conceptual gap between compression and mutability. The research community has struggled with the balance between compression and mutability since highly compressed data structures are not able to be altered without reconstruction. This poses unduly constraints when trying to apply these structures to biological datasets that routinely get updated with new data. This project will make significant developments in this area by developing compressed data structures that are mutable for our realization of our pangenomics index. Project website: www.christinaboucher.com/pangenomics-iibrThis award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

在过去的十年中，人们一直在努力对给定物种的大量个体的DNA进行测序和比较，从而不仅产生了单个参考基因组，而且产生了给定物种的基因组群体。现在有大量的公共数据可用，包括1,000个基因组计划，100 K基因组计划，1001个拟南芥基因组计划，水稻基因组注释计划和鸟类10,000个基因组（B10 K）计划。关键软件，称为短读比对器，将新测序的DNA片段与一个（或多个）参考基因组进行比对，以识别物种内的遗传变异。对这种遗传变异的下游分析发现了复杂疾病和表型之间的因果关系。现有的短读段比对器不能与大量的参考基因组比对，这纯粹是由于计算限制。因此，使用少量基因组进行比对减少了存储器和时间限制。不幸的是，尽管相同物种的个体之间存在很大比例的遗传相似性，但差异也很重要，并且仅与给定物种的少量基因组比对可能导致一些DNA片段不比对或比对不良。这反过来又使得寻找新测序的DNA片段和参考基因组之间的遗传变异更具挑战性。克服这一挑战的一种方式是开发用于短读段比对的新算法和数据结构，其减少计算资源。该项目通过开发一种新的基因组群体表示法，并创建构建、存储和更新所需的算法和数据结构来实现这一愿景。因此，该项目的目标是推进生物科学和模式物种的知识，并进一步发展支持第一代大学毕业生的推广计划。这项工作的一个直接成果将是通过旨在促进第一代大学学者成功的组织“梅琴佛罗里达机会学者计划”，为服务不足的学生提供研究机会。短读段比对器首先从一个或多个参考基因组构建索引，随后使用它来寻找和延伸序列读段和参考之间的匹配的序列。使用这些read aligners索引数千个基因组的瓶颈是构建和存储索引所需的空间和时间。为了解决与使用单个参考基因组相关的缺点，基于图的泛基因组学比对器的概念已经被引入并在社区中广泛讨论。尽管这些方法已经显示出比标准的基于序列的比对器提高了准确性，但是它们的使用还没有被充分探索。阻碍实现泛基因组学图对齐的挑战是可扩展性。该项目的目标是开发算法，允许从大量人群收集的数据集构建泛基因组学参考。为了实现这一目标，将创建和实施构建、压缩和更新封装种群中发现的变异的图的新方法。因此，这项工作将需要进一步的进步，其影响超出了所述的应用。更具体地，不知道如何合并r索引，使用次线性空间表示引用的图模型，或者使用r索引表示图。该项目将解决这些开放的问题，并更广泛地连接两个研究领域：简洁的数据结构和泛基因组学。接下来，该项目将最小化压缩和可变性之间的概念差距。研究界一直在努力在压缩和可变性之间取得平衡，因为高度压缩的数据结构在没有重建的情况下无法改变。当试图将这些结构应用于经常更新新数据的生物数据集时，这造成了过度的限制。该项目将通过开发可变的压缩数据结构来实现我们的泛基因组学索引，从而在这一领域取得重大进展。项目网站：www.christinaboucher.com/pangenomics-iibrThis奖项反映了NSF的法定使命，并被认为值得通过使用基金会的知识价值和更广泛的影响审查标准进行评估的支持。

项目成果

期刊论文数量（19）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

Computational graph pangenomics: a tutorial on data structures and their applications.

DOI：
10.1007/s11047-022-09882-6
发表时间：
2022-03
期刊：
NATURAL COMPUTING
影响因子：
2.1
作者：
Baaijens, Jasmijn A.;Bonizzoni, Paola;Boucher, Christina;Della Vedova, Gianluca;Pirola, Yuri;Rizzi, Raffaella;Siren, Jouni
通讯作者：
Siren, Jouni

A Fast and Small Subsampled R-Index

快速且小型的二次采样 R 指数

DOI：
10.4230/lipics.cpm.2021.13
发表时间：
2021
期刊：
Leibniz international proceedings in informatics
影响因子：
0
作者：
Cobas, Dustin;Gagie, Travis;Navarro, Gonzalo
通讯作者：
Navarro, Gonzalo

Efficiently Merging r-indexes

高效合并 r 索引

DOI：
10.1109/dcc50243.2021.00028
发表时间：
2021
期刊：
2021 Data Compression Conference (DCC
影响因子：
0
作者：
Oliva, Marco;Rossi, Massimiliano;Siren, Jouni;Manzini, Giovanni;Kahveci, Tamer;Gagie, Travis;Boucher, Christina
通讯作者：
Boucher, Christina

On Representing the Degree Sequences of Sublogarithmic-Degree Wheeler Graphs

关于次对数度惠勒图的度数列的表示

DOI：
10.1007/978-3-031-20643-6_18
发表时间：
2022
期刊：
SPIRE
影响因子：
0
作者：
T. Gagie
通讯作者：
T. Gagie

Compressing and Indexing Aligned Readsets

压缩和索引对齐的读取集

DOI：
10.4230/lipics.wabi.2021.13
发表时间：
2021
期刊：
Workshop on Algorithms in Bioinformatics (WABI
影响因子：
0
作者：
Gagie, Travis;Gourdel, Garance;Manzini, Giovanni
通讯作者：
Manzini, Giovanni

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

Christina Boucher其他文献

ONeSAMP 3.0: Effective Population Size via SNP Data for One Population Sample

ONeSAMP 3.0：通过一个群体样本的 SNP 数据获得有效群体规模

DOI：
发表时间：
2023
期刊：
bioRxiv
影响因子：
0
作者：
Aaron Hong;R. G. Cheek;Kingshuk Mukherjee;Isha Yooseph;Marco Oliva;Mark Heim;W. C. Funk;David Tallmon;Christina Boucher
通讯作者：
Christina Boucher

Data Structures for SMEM-Finding in the PBWT

PBWT 中 SMEM 查找的数据结构

DOI：
10.1007/978-3-031-43980-3_8
发表时间：
2023
期刊：
Theoretical and Applied Genetics
影响因子：
5.4
作者：
Paola Bonizzoni;Christina Boucher;D. Cozzi;Travis Gagie;Dominik Köppl;Massimiliano Rossi
通讯作者：
Massimiliano Rossi

A study at the wildlife-livestock interface unveils the potential of feral swine as a reservoir for extended-spectrum β-lactamase-producing emEscherichia coli/em

一项针对野生动物与家畜交界地区的研究揭示了野猪作为产超广谱β-内酰胺酶大肠埃希菌宿主的潜力。

DOI：
10.1016/j.jhazmat.2024.134694
发表时间：
2024-07-15
期刊：
JOURNAL OF HAZARDOUS MATERIALS
影响因子：
11.300
作者：
Ting Liu;Shinyoung Lee;Miju Kim;Peixin Fan;Raoul K. Boughton;Christina Boucher;Kwangcheol C. Jeong
通讯作者：
Kwangcheol C. Jeong