K-mer indexing for pan-genome reference annotation
用于泛基因组参考注释的 K-mer 索引
基本信息
- 批准号:10793082
- 负责人:
- 金额:$ 30万
- 依托单位:
- 依托单位国家:美国
- 项目类别:
- 财政年份:2023
- 资助国家:美国
- 起止时间:2023-02-22 至 2024-01-31
- 项目状态:已结题
- 来源:
- 关键词:AccelerationAddressAlgorithmsArchitectureBRCA mutationsBiologicalBiomedical ResearchBiteChromosomesClinVarClinicalClinical assessmentsCloud ComputingCodeCollectionCommunitiesComplexDataData SetDatabasesDevelopmentDiploidyDiseaseElementsFoundationsFrequenciesGene FrequencyGenesGenetic AnnotationGenetic CodeGenetic PolymorphismGenetic VariationGenomeGenomicsGoalsHaplotypesHumanHuman BiologyHuman GeneticsHuman GenomeIndividualInfrastructureIntuitionLengthLinkLocationMapsMemoryMetadataMethodsNatureNucleotidesOncogenesPerformancePersonsPhasePopulationPrivacyProcessResearchResearch PersonnelResolutionSamplingSavingsSchemeSequence AnalysisSpeedSystemUpdateVariantWorkclinical applicationclinically relevantcloud basedcommunity engagementcostdata sharingdesignflexibilityfootgenetic variantgenome sciencesgenome sequencinggenomic datahuman diseasehuman reference genomeimprovedindexingnext generationnext generation sequencingnovelpan-genomepopulation basedpreservationreference genomeweb portal
项目摘要
ABSTRACT
The human genome reference sequence is one of the foundations of genome sciences, especially in the context
of next-generation sequencing (NGS) analysis. The reference has enabled discoveries in biomedical research
and been particularly instrumental in human disease gene identification. However, the human genome reference
is limited by its static and linear nature. Specifically, the current reference lacks the featural and contextual
flexibility to represent the breadth of human variation. Important elements of individual genomes are either
missed or incorrectly represented. As a solution that will bridge the next generation of reference assemblies with
population genome sequencing studies, we have developed a K-mer-based indexing approach. This method is
more efficient computationally, provides accurate representation in the context of populations and facilitates the
analysis of diverse human genomes. Our goal is to use this strategy in developing a robust computational
architecture that will encode and annotate large collections of genomes in the context of a pan-genome
reference.
First, we plan to develop a scalable, efficient K-mer representation of a large collection of haplotype/phased
reference genomes, by 1) generating an index of all K-mers in human reference genome GRCh38 in a manner
that can efficiently store variant information as metadata, and then 2) incrementally updating the K-mer index to
include all novel K-mers derived from ongoing population sequencing efforts, while 3) developing schemes for
directly analyzing compressed genomic data.
Second, we plan to apply K-mer representation to genomic analysis by 1) providing the entirety of known
human genetic variation in an aggregated index that is computationally efficient and easy to understand, 2)
developing functions for our pan-genomic index that supports ultra-rapid queries, such as of clinically important
variants, and 3) linking conventional coordinate information to the K-mer metadata in the pan-genome index to
allow annotating genetic variation to a particular genome reference.
Third, we will create an online web portal for the pan-genome, using cloud computing, to maximize the utility
of our approach, to promote community engagement and to enabling contribution from the research community.
We expect that completion of these aims will provide: a scalable computational architecture which incorporates
the continuous addition of variant information without loss of resolution or accuracy;; rapid query speeds that will
remain nearly constant as the database grows;; a universally accessible portal using cloud computing.
This work will help solve the issues of multiple assemblies. It will improve researchers’ ability to understand
the relationship of variants and disease, while also providing great savings over the long-term in infrastructure
and computing costs.
摘要
人类基因组参考序列是基因组科学的基础之一,特别是在
下一代测序(NGS)分析的参考文献。该参考文献使生物医学研究中的发现成为可能
并且在人类疾病基因鉴定中特别有用。然而,人类基因组参考
受到其静态和线性性质的限制。 具体而言,目前的参考文献缺乏特征性和背景性
灵活性,以代表人类变异的广度。 个体基因组的重要元素是
作为一种解决方案,它将在下一代引用程序集与
在群体基因组测序研究中,我们开发了一种基于K-聚体的索引方法,
更有效的计算,提供准确的代表性的背景下,人口和促进
分析不同的人类基因组。 我们的目标是使用这种策略来开发一个强大的计算
一个架构,将编码和注释大集合的基因组在一个泛基因组的背景下
参考
首先,我们计划开发一个可扩展的,有效的K-聚体表示的一个大的收集单倍型/分阶段
参考基因组,通过1)以如下方式生成人参考基因组GRCh 38中所有K-聚体的索引:
可以有效地将变体信息存储为元数据,然后2)递增地更新K-聚合物索引,
包括所有来自正在进行的群体测序工作的新型K-聚体,同时3)开发用于
直接分析压缩的基因组数据
第二,我们计划通过以下方式将K-聚体表示应用于基因组分析:1)提供已知的全部
计算效率高且易于理解的综合指数中的人类遗传变异,2)
为我们的泛基因组索引开发功能,支持超快速查询,例如临床重要的
变体,以及3)将常规坐标信息与泛基因组索引中的K-聚体元数据相关联,
允许注释特定基因组参考的遗传变异。
第三,我们将为泛基因组创建一个在线门户网站,使用云计算,以最大限度地发挥效用
我们的方法,以促进社区参与和促进研究界的贡献。
我们期望这些目标的完成将提供:一个可扩展的计算架构,
在不损失分辨率或准确性的情况下连续添加变体信息;提高查询速度,
随着数据库的增长几乎保持不变;使用云计算建立一个普遍可访问的门户。
这项工作将有助于解决多个组件的问题,提高研究人员的理解能力,
变异和疾病的关系,同时也在长期内节省了大量的基础设施
和计算成本。
项目成果
期刊论文数量(1)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Unique k-mer sequences for validating cancer-related substitution, insertion and deletion mutations.
- DOI:10.1093/narcan/zcaa034
- 发表时间:2020-12
- 期刊:
- 影响因子:5.1
- 作者:Lee H;Shuaibi A;Bell JM;Pavlichin DS;Ji HP
- 通讯作者:Ji HP
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Hanlee P Ji其他文献
Improving bioinformatic pipelines for exome variant calling
- DOI:
10.1186/gm306 - 发表时间:
2012-01-01 - 期刊:
- 影响因子:11.200
- 作者:
Hanlee P Ji - 通讯作者:
Hanlee P Ji
Hanlee P Ji的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Hanlee P Ji', 18)}}的其他基金
Integrating cancer genomics and spatial architecture of tumor infiltrating lymphocytes
整合癌症基因组学和肿瘤浸润淋巴细胞的空间结构
- 批准号:
10637960 - 财政年份:2023
- 资助金额:
$ 30万 - 项目类别:
Project 1 - Molecular and Cellular Determinants of High Risk Gastric Precancerous Lesions
项目1——高危胃癌癌前病变的分子和细胞决定因素
- 批准号:
10715762 - 财政年份:2023
- 资助金额:
$ 30万 - 项目类别:
Multimodal iterative sequencing of cancer genomes and single tumor cells
癌症基因组和单个肿瘤细胞的多模式迭代测序
- 批准号:
10363694 - 财政年份:2021
- 资助金额:
$ 30万 - 项目类别:
Multimodal iterative sequencing of cancer genomes and single tumor cells
癌症基因组和单个肿瘤细胞的多模式迭代测序
- 批准号:
10112576 - 财政年份:2021
- 资助金额:
$ 30万 - 项目类别:
Multimodal iterative sequencing of cancer genomes and single tumor cells
癌症基因组和单个肿瘤细胞的多模式迭代测序
- 批准号:
10576304 - 财政年份:2021
- 资助金额:
$ 30万 - 项目类别:
相似海外基金
Rational design of rapidly translatable, highly antigenic and novel recombinant immunogens to address deficiencies of current snakebite treatments
合理设计可快速翻译、高抗原性和新型重组免疫原,以解决当前蛇咬伤治疗的缺陷
- 批准号:
MR/S03398X/2 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
Fellowship
CAREER: FEAST (Food Ecosystems And circularity for Sustainable Transformation) framework to address Hidden Hunger
职业:FEAST(食品生态系统和可持续转型循环)框架解决隐性饥饿
- 批准号:
2338423 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
Continuing Grant
Re-thinking drug nanocrystals as highly loaded vectors to address key unmet therapeutic challenges
重新思考药物纳米晶体作为高负载载体以解决关键的未满足的治疗挑战
- 批准号:
EP/Y001486/1 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
Research Grant
Metrology to address ion suppression in multimodal mass spectrometry imaging with application in oncology
计量学解决多模态质谱成像中的离子抑制问题及其在肿瘤学中的应用
- 批准号:
MR/X03657X/1 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
Fellowship
CRII: SHF: A Novel Address Translation Architecture for Virtualized Clouds
CRII:SHF:一种用于虚拟化云的新型地址转换架构
- 批准号:
2348066 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
The Abundance Project: Enhancing Cultural & Green Inclusion in Social Prescribing in Southwest London to Address Ethnic Inequalities in Mental Health
丰富项目:增强文化
- 批准号:
AH/Z505481/1 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
Research Grant
ERAMET - Ecosystem for rapid adoption of modelling and simulation METhods to address regulatory needs in the development of orphan and paediatric medicines
ERAMET - 快速采用建模和模拟方法的生态系统,以满足孤儿药和儿科药物开发中的监管需求
- 批准号:
10107647 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
EU-Funded
BIORETS: Convergence Research Experiences for Teachers in Synthetic and Systems Biology to Address Challenges in Food, Health, Energy, and Environment
BIORETS:合成和系统生物学教师的融合研究经验,以应对食品、健康、能源和环境方面的挑战
- 批准号:
2341402 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
Ecosystem for rapid adoption of modelling and simulation METhods to address regulatory needs in the development of orphan and paediatric medicines
快速采用建模和模拟方法的生态系统,以满足孤儿药和儿科药物开发中的监管需求
- 批准号:
10106221 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
EU-Funded
Recite: Building Research by Communities to Address Inequities through Expression
背诵:社区开展研究,通过表达解决不平等问题
- 批准号:
AH/Z505341/1 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
Research Grant