权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

K-mer indexing for pan-genome reference annotation

用于泛基因组参考注释的 K-mer 索引

基本信息

批准号：
10793082
负责人：
Hanlee P Ji
金额：
$ 30万
依托单位：
STANFORD UNIVERSITY
依托单位国家：
美国
项目类别：
财政年份：
2023
资助国家：
美国
起止时间：
2023-02-22 至 2024-01-31
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/10793082
关键词：
Acceleration Address Algorithms Architecture BRCA mutations Biological Biomedical Research Bite Chromosomes ClinVar Clinical Clinical assessments Cloud Computing Code Collection Communities Complex Data Data Set Databases Development Diploidy Disease Elements Foundations Frequencies Gene Frequency Genes Genetic Annotation Genetic Code Genetic Polymorphism Genetic Variation Genome Genomics Goals Haplotypes Human Human Biology Human Genetics Human Genome Individual Infrastructure Intuition Length Link Location Maps Memory Metadata Methods Nature Nucleotides Oncogenes Performance Persons Phase Population Privacy Process Research Research Personnel Resolution Sampling Savings Scheme Sequence Analysis Speed System Update Variant Work clinical application clinically relevant cloud based community engagement cost data sharing design flexibility foot genetic variant genome sciences genome sequencing genomic data human disease human reference genome improved indexing next generation next generation sequencing novel pan-genome population based preservation reference genome web portal

项目摘要

ABSTRACT The human genome reference sequence is one of the foundations of genome sciences, especially in the context of next-generation sequencing (NGS) analysis. The reference has enabled discoveries in biomedical research and been particularly instrumental in human disease gene identification. However, the human genome reference is limited by its static and linear nature. Specifically, the current reference lacks the featural and contextual flexibility to represent the breadth of human variation. Important elements of individual genomes are either missed or incorrectly represented. As a solution that will bridge the next generation of reference assemblies with population genome sequencing studies, we have developed a K-mer-based indexing approach. This method is more efficient computationally, provides accurate representation in the context of populations and facilitates the analysis of diverse human genomes. Our goal is to use this strategy in developing a robust computational architecture that will encode and annotate large collections of genomes in the context of a pan-genome reference. First, we plan to develop a scalable, efficient K-mer representation of a large collection of haplotype/phased reference genomes, by 1) generating an index of all K-mers in human reference genome GRCh38 in a manner that can efficiently store variant information as metadata, and then 2) incrementally updating the K-mer index to include all novel K-mers derived from ongoing population sequencing efforts, while 3) developing schemes for directly analyzing compressed genomic data. Second, we plan to apply K-mer representation to genomic analysis by 1) providing the entirety of known human genetic variation in an aggregated index that is computationally efficient and easy to understand, 2) developing functions for our pan-genomic index that supports ultra-rapid queries, such as of clinically important variants, and 3) linking conventional coordinate information to the K-mer metadata in the pan-genome index to allow annotating genetic variation to a particular genome reference. Third, we will create an online web portal for the pan-genome, using cloud computing, to maximize the utility of our approach, to promote community engagement and to enabling contribution from the research community. We expect that completion of these aims will provide: a scalable computational architecture which incorporates the continuous addition of variant information without loss of resolution or accuracy;; rapid query speeds that will remain nearly constant as the database grows;; a universally accessible portal using cloud computing. This work will help solve the issues of multiple assemblies. It will improve researchers’ ability to understand the relationship of variants and disease, while also providing great savings over the long-term in infrastructure and computing costs.

摘要人类基因组参考序列是基因组科学的基础之一，特别是在下一代测序（NGS）分析的参考文献。该参考文献使生物医学研究中的发现成为可能并且在人类疾病基因鉴定中特别有用。然而，人类基因组参考受到其静态和线性性质的限制。具体而言，目前的参考文献缺乏特征性和背景性灵活性，以代表人类变异的广度。个体基因组的重要元素是作为一种解决方案，它将在下一代引用程序集与在群体基因组测序研究中，我们开发了一种基于K-聚体的索引方法，更有效的计算，提供准确的代表性的背景下，人口和促进分析不同的人类基因组。我们的目标是使用这种策略来开发一个强大的计算一个架构，将编码和注释大集合的基因组在一个泛基因组的背景下参考首先，我们计划开发一个可扩展的，有效的K-聚体表示的一个大的收集单倍型/分阶段参考基因组，通过1）以如下方式生成人参考基因组GRCh 38中所有K-聚体的索引：可以有效地将变体信息存储为元数据，然后2）递增地更新K-聚合物索引，包括所有来自正在进行的群体测序工作的新型K-聚体，同时3）开发用于直接分析压缩的基因组数据第二，我们计划通过以下方式将K-聚体表示应用于基因组分析：1）提供已知的全部计算效率高且易于理解的综合指数中的人类遗传变异，2）为我们的泛基因组索引开发功能，支持超快速查询，例如临床重要的变体，以及3）将常规坐标信息与泛基因组索引中的K-聚体元数据相关联，允许注释特定基因组参考的遗传变异。第三，我们将为泛基因组创建一个在线门户网站，使用云计算，以最大限度地发挥效用我们的方法，以促进社区参与和促进研究界的贡献。我们期望这些目标的完成将提供：一个可扩展的计算架构，在不损失分辨率或准确性的情况下连续添加变体信息;提高查询速度，随着数据库的增长几乎保持不变;使用云计算建立一个普遍可访问的门户。这项工作将有助于解决多个组件的问题，提高研究人员的理解能力，变异和疾病的关系，同时也在长期内节省了大量的基础设施和计算成本。