IIBR Informatics: An Efficient Pangenomics Graph Aligner
IIBR 信息学:高效的泛基因组图对齐器
基本信息
- 批准号:2029552
- 负责人:
- 金额:$ 70.04万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2020
- 资助国家:美国
- 起止时间:2020-09-01 至 2024-08-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
In the past decade, there has been an effort to sequence and compare the DNA of a large number of individuals of a given species, resulting in not just a single reference genome but a population of genomes of a given species. Enormous public data now are available including the 1,000 Genome Project, the 100K Genome Project, the 1001 Arabidopsis Genomes project, the Rice Genome Annotation Project, and the Bird 10,000 Genomes (B10K) Project. Key software, called short read aligners, align newly sequenced DNA fragments to one (or more) reference genome(s) in order to identify genetic variation within the species. The downstream analysis of this genetic variation finds causal relationships between complex diseases and phenotypes. Existing short read aligners are unable to align to a large number of reference genome(s), due purely to computational constraints. Hence, using a small number of genome(s) to align to reduces the memory and time constraints. Unfortunately, although there is a large percentage genetic similarity between individuals of the same species, the differences are also important and aligning to only a small number of genomes of a given species can lead to some of the DNA fragments not aligning or aligning poorly. This, in turn, makes finding genetic variation between the newly sequenced DNA fragments and the reference genome(s) more challenging. One manner to overcome this challenge is to develop new algorithms and data structures for short read alignment that reduce the computational resources. This project realizes this vision by developing a novel representation of a population of genomes, and creating the algorithms and data structures needed to build, store and update it. Thus, integrated into this project is the goal of advancing biological science and knowledge of model species, and the ideas, and furthering the development of an outreach program that supports first-generation university graduates. An immediate outcome of the work will be research opportunities to under-served students through the Machen Florida Opportunity Scholars program, an organization that aims to foster the success of first-generation university scholars. Short read aligners first build an index from one or more reference genome(s) and subsequently use it to find and extend matched subsequences between sequence reads and the reference(s). The bottleneck of using these read aligners to index thousands of genomes is the space and time needed for construct and store the index. To address the shortcomings associated with using a single reference genome, the concept of graph-based pangenomics aligners has been introduced and widely discussed in the community. Although such methods have been shown to improve on the accuracy over standard sequence-based aligners, their use has not been fully explored. The challenge that prevents the realization a pangenomics graph alignment is that of scalability. The goal of the project is to the developing algorithms that allow for the construction of a pangenomic reference from datasets gathered from large populations. In order to achieve this goal, novel means to build, compress, and update a graph that encapsulates the variation found in the population will be created and implemented. Thus, this work will require further advancements that have impact beyond the stated application. More specifically, it is unknown how to merge the r-index, represent a graph-model of references using sub-linear space, or represent the graph using the r-index. This project will address these open problems, and more broadly, connect two areas of research: succinct data structures and pangenomics. Next, the project will minimize the conceptual gap between compression and mutability. The research community has struggled with the balance between compression and mutability since highly compressed data structures are not able to be altered without reconstruction. This poses unduly constraints when trying to apply these structures to biological datasets that routinely get updated with new data. This project will make significant developments in this area by developing compressed data structures that are mutable for our realization of our pangenomics index. Project website: www.christinaboucher.com/pangenomics-iibrThis award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
在过去的十年中,人们一直在努力对给定物种的大量个体的DNA进行测序和比较,从而不仅产生了单个参考基因组,而且产生了给定物种的基因组群体。现在有大量的公共数据可用,包括1,000个基因组计划,100 K基因组计划,1001个拟南芥基因组计划,水稻基因组注释计划和鸟类10,000个基因组(B10 K)计划。 关键软件,称为短读比对器,将新测序的DNA片段与一个(或多个)参考基因组进行比对,以识别物种内的遗传变异。 对这种遗传变异的下游分析发现了复杂疾病和表型之间的因果关系。 现有的短读段比对器不能与大量的参考基因组比对,这纯粹是由于计算限制。因此,使用少量基因组进行比对减少了存储器和时间限制。 不幸的是,尽管相同物种的个体之间存在很大比例的遗传相似性,但差异也很重要,并且仅与给定物种的少量基因组比对可能导致一些DNA片段不比对或比对不良。 这反过来又使得寻找新测序的DNA片段和参考基因组之间的遗传变异更具挑战性。克服这一挑战的一种方式是开发用于短读段比对的新算法和数据结构,其减少计算资源。该项目通过开发一种新的基因组群体表示法,并创建构建、存储和更新所需的算法和数据结构来实现这一愿景。因此,该项目的目标是推进生物科学和模式物种的知识,并进一步发展支持第一代大学毕业生的推广计划。这项工作的一个直接成果将是通过旨在促进第一代大学学者成功的组织“梅琴佛罗里达机会学者计划”,为服务不足的学生提供研究机会。 短读段比对器首先从一个或多个参考基因组构建索引,随后使用它来寻找和延伸序列读段和参考之间的匹配的序列。使用这些read aligners索引数千个基因组的瓶颈是构建和存储索引所需的空间和时间。 为了解决与使用单个参考基因组相关的缺点,基于图的泛基因组学比对器的概念已经被引入并在社区中广泛讨论。尽管这些方法已经显示出比标准的基于序列的比对器提高了准确性,但是它们的使用还没有被充分探索。 阻碍实现泛基因组学图对齐的挑战是可扩展性。该项目的目标是开发算法,允许从大量人群收集的数据集构建泛基因组学参考。 为了实现这一目标,将创建和实施构建、压缩和更新封装种群中发现的变异的图的新方法。因此,这项工作将需要进一步的进步,其影响超出了所述的应用。更具体地,不知道如何合并r索引,使用次线性空间表示引用的图模型,或者使用r索引表示图。该项目将解决这些开放的问题,并更广泛地连接两个研究领域:简洁的数据结构和泛基因组学。 接下来,该项目将最小化压缩和可变性之间的概念差距。 研究界一直在努力在压缩和可变性之间取得平衡,因为高度压缩的数据结构在没有重建的情况下无法改变。 当试图将这些结构应用于经常更新新数据的生物数据集时,这造成了过度的限制。 该项目将通过开发可变的压缩数据结构来实现我们的泛基因组学索引,从而在这一领域取得重大进展。项目网站:www.christinaboucher.com/pangenomics-iibrThis奖项反映了NSF的法定使命,并被认为值得通过使用基金会的知识价值和更广泛的影响审查标准进行评估的支持。
项目成果
期刊论文数量(19)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Computational graph pangenomics: a tutorial on data structures and their applications.
- DOI:10.1007/s11047-022-09882-6
- 发表时间:2022-03
- 期刊:
- 影响因子:2.1
- 作者:Baaijens, Jasmijn A.;Bonizzoni, Paola;Boucher, Christina;Della Vedova, Gianluca;Pirola, Yuri;Rizzi, Raffaella;Siren, Jouni
- 通讯作者:Siren, Jouni
A Fast and Small Subsampled R-Index
快速且小型的二次采样 R 指数
- DOI:10.4230/lipics.cpm.2021.13
- 发表时间:2021
- 期刊:
- 影响因子:0
- 作者:Cobas, Dustin;Gagie, Travis;Navarro, Gonzalo
- 通讯作者:Navarro, Gonzalo
Efficiently Merging r-indexes
高效合并 r 索引
- DOI:10.1109/dcc50243.2021.00028
- 发表时间:2021
- 期刊:
- 影响因子:0
- 作者:Oliva, Marco;Rossi, Massimiliano;Siren, Jouni;Manzini, Giovanni;Kahveci, Tamer;Gagie, Travis;Boucher, Christina
- 通讯作者:Boucher, Christina
On Representing the Degree Sequences of Sublogarithmic-Degree Wheeler Graphs
关于次对数度惠勒图的度数列的表示
- DOI:10.1007/978-3-031-20643-6_18
- 发表时间:2022
- 期刊:
- 影响因子:0
- 作者:T. Gagie
- 通讯作者:T. Gagie
Compressing and Indexing Aligned Readsets
压缩和索引对齐的读取集
- DOI:10.4230/lipics.wabi.2021.13
- 发表时间:2021
- 期刊:
- 影响因子:0
- 作者:Gagie, Travis;Gourdel, Garance;Manzini, Giovanni
- 通讯作者:Manzini, Giovanni
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Christina Boucher其他文献
ONeSAMP 3.0: Effective Population Size via SNP Data for One Population Sample
ONeSAMP 3.0:通过一个群体样本的 SNP 数据获得有效群体规模
- DOI:
- 发表时间:
2023 - 期刊:
- 影响因子:0
- 作者:
Aaron Hong;R. G. Cheek;Kingshuk Mukherjee;Isha Yooseph;Marco Oliva;Mark Heim;W. C. Funk;David Tallmon;Christina Boucher - 通讯作者:
Christina Boucher
Data Structures for SMEM-Finding in the PBWT
PBWT 中 SMEM 查找的数据结构
- DOI:
10.1007/978-3-031-43980-3_8 - 发表时间:
2023 - 期刊:
- 影响因子:5.4
- 作者:
Paola Bonizzoni;Christina Boucher;D. Cozzi;Travis Gagie;Dominik Köppl;Massimiliano Rossi - 通讯作者:
Massimiliano Rossi
A study at the wildlife-livestock interface unveils the potential of feral swine as a reservoir for extended-spectrum β-lactamase-producing emEscherichia coli/em
一项针对野生动物与家畜交界地区的研究揭示了野猪作为产超广谱β-内酰胺酶大肠埃希菌宿主的潜力。
- DOI:
10.1016/j.jhazmat.2024.134694 - 发表时间:
2024-07-15 - 期刊:
- 影响因子:11.300
- 作者:
Ting Liu;Shinyoung Lee;Miju Kim;Peixin Fan;Raoul K. Boughton;Christina Boucher;Kwangcheol C. Jeong - 通讯作者:
Kwangcheol C. Jeong
A comparative study of antibiotic resistance patterns in Mycobacterium tuberculosis
结核分枝杆菌抗生素耐药模式的比较研究
- DOI:
10.1038/s41598-025-89087-w - 发表时间:
2025-02-11 - 期刊:
- 影响因子:3.900
- 作者:
Mohammadali Serajian;Conrad Testagrose;Mattia Prosperi;Christina Boucher - 通讯作者:
Christina Boucher
Solving the Minimal Positional Substring Cover Problem in Sublinear Space
解决次线性空间中的最小位置子串覆盖问题
- DOI:
- 发表时间:
2024 - 期刊:
- 影响因子:0
- 作者:
Paola Bonizzoni;Christina Boucher;D. Cozzi;Travis Gagie;Yuri Pirola - 通讯作者:
Yuri Pirola
Christina Boucher的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Christina Boucher', 18)}}的其他基金
Collaborative Research: EAGER: Solving the bait learning problem for large-scale DNA enrichment
合作研究:EAGER:解决大规模 DNA 富集的诱饵学习问题
- 批准号:
2118251 - 财政年份:2021
- 资助金额:
$ 70.04万 - 项目类别:
Standard Grant
SCH: INT: Enabling real time surveillance of antimicrobial resistance
SCH:INT:实现抗菌药物耐药性的实时监测
- 批准号:
2013998 - 财政年份:2021
- 资助金额:
$ 70.04万 - 项目类别:
Standard Grant
III: Small: Collaborative Research: A Scalable and Efficient Optical Map Assembler
III:小型:协作研究:可扩展且高效的光学地图组装器
- 批准号:
1618814 - 财政年份:2016
- 资助金额:
$ 70.04万 - 项目类别:
Standard Grant
相似海外基金
REU Site: Program for Access to Training in Health Informatics (PATHI)
REU 网站:健康信息学培训计划 (PATHI)
- 批准号:
2348793 - 财政年份:2024
- 资助金额:
$ 70.04万 - 项目类别:
Standard Grant
Travel: IEEE International Conference on Healthcare Informatics (IEEE ICHI 2024) Doctoral Consortium Travel Scholarship
旅行:IEEE 国际医疗信息学会议 (IEEE ICHI 2024) 博士联盟旅行奖学金
- 批准号:
2414093 - 财政年份:2024
- 资助金额:
$ 70.04万 - 项目类别:
Standard Grant
Reliable Tensor-Network Fusion Approach to Medical Informatics: Novel Techniques and Benchmarks
可靠的张量网络融合医学信息学方法:新技术和基准
- 批准号:
24K03005 - 财政年份:2024
- 资助金额:
$ 70.04万 - 项目类别:
Grant-in-Aid for Scientific Research (B)
CAREER: Transforming Personal Informatics Systems to Support Routine Transitions in Healthy Eating
职业:转变个人信息系统以支持健康饮食的常规转变
- 批准号:
2414270 - 财政年份:2023
- 资助金额:
$ 70.04万 - 项目类别:
Continuing Grant
Travel: NSF Student Travel Grant for 2023 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI)
旅行:2023 年 IEEE-EMBS 国际生物医学和健康信息学会议 (BHI) 的 NSF 学生旅行补助金
- 批准号:
2331680 - 财政年份:2023
- 资助金额:
$ 70.04万 - 项目类别:
Standard Grant
Development of Informatics Materials with an Awareness of the High School-University connection and a Learning Support Environment for Data-Driven Instruction
开发具有高中与大学联系意识的信息学材料和数据驱动教学的学习支持环境
- 批准号:
23H01019 - 财政年份:2023
- 资助金额:
$ 70.04万 - 项目类别:
Grant-in-Aid for Scientific Research (B)
Categorical Duality and Semantics Across Mathematics, Informatics and Physics and their Applications to Categorical Machine Learning and Quantum Computing
数学、信息学和物理领域的分类对偶性和语义及其在分类机器学习和量子计算中的应用
- 批准号:
23K13008 - 财政年份:2023
- 资助金额:
$ 70.04万 - 项目类别:
Grant-in-Aid for Early-Career Scientists
Pioneering Research of industrial materials informatics for innovative lithium battery anodes
创新锂电池阳极工业材料信息学的开创性研究
- 批准号:
23K18465 - 财政年份:2023
- 资助金额:
$ 70.04万 - 项目类别:
Grant-in-Aid for Challenging Research (Exploratory)
ACTS (AD Clinical Trial Simulation): Developing Advanced Informatics Approaches for an Alzheimer's Disease Clinical Trial Simulation System
ACTS(AD 临床试验模拟):为阿尔茨海默病临床试验模拟系统开发先进的信息学方法
- 批准号:
10753675 - 财政年份:2023
- 资助金额:
$ 70.04万 - 项目类别:
Establishment of polymer informatics incorporating polymer-specific hierarchy and search for new electrolytes
建立结合聚合物特定层次结构的聚合物信息学并寻找新的电解质
- 批准号:
23H02027 - 财政年份:2023
- 资助金额:
$ 70.04万 - 项目类别:
Grant-in-Aid for Scientific Research (B)














{{item.name}}会员




