Scalable post-assembly editing software for finishing and annotating personal genomes
可扩展的组装后编辑软件,用于完成和注释个人基因组
基本信息
- 批准号:9767335
- 负责人:
- 金额:$ 75万
- 依托单位:
- 依托单位国家:美国
- 项目类别:
- 财政年份:2018
- 资助国家:美国
- 起止时间:2018-09-01 至 2021-02-28
- 项目状态:已结题
- 来源:
- 关键词:AddressAlgorithmsAllelesAutomated AnnotationAwarenessBacterial GenomeBase SequenceBiological MarkersCatalogingCatalogsChromosomes, Human, Pair 12ComplementComplexComputer softwareComputersConsensus SequenceDNA ResequencingDNA sequencingDataDiagnosisDiploidyDisease susceptibilityFoundationsGenerationsGenesGeneticGenetic VariationGenomeGenomicsGleanGoalsHaplotypesHourHuman GenomeImageryIndividualManualsMapsPerformancePersonsPhasePhenotypePolishesPopulationProteinsRecording of previous eventsResourcesRunningTechnologyVariantWritingbasecausal variantcohortcontigcostdesignexperiencefile formatgenome annotationgraphical user interfacehuman diseaseimprovedknowledge basenext generation sequencingopen sourcepersonalized medicineprogramsprototypereference genomescaffoldsuccesstoolwhole genome
项目摘要
We are entering a new era of personal genomics where an individual's genome sequence will be used to
identify disease susceptibility, improve diagnosis and better treat illnesses as well as be combined across
cohorts and populations to identify new biomarkers and causal mutations underlying any phenotype. Despite
the tremendous success of mapping short read next-generation sequencing (NGS) data onto a reference
genome (resequencing) in identifying genetic variation in a new genome, the inherent lack of long range
connectivity together with reference-induced biases make obtaining complete haplotype-phased genomes
exceedingly difficult. Emerging long read technologies are beginning to address this critical shortcoming by
direct de novo assembly of an individual's genome. However, initial de novo assemblies typically consist of
many thousands of unordered contigs that require extensive post-assembly processing to produce finished
sequences that can be effectively mined for genetic content and variation. Thus, there is an urgent need for
integrated, scalable post-assembly software that 1) automatically organizes, joins and phases the initial contigs
into complete haplotype sequences, 2) supports optional NGS and/or manual polishing and 3) provides initial
automated annotation of those sequences. Currently, such software does not exist and instead users must
cobble together a confusing array of difficult-to-use, task-specific pieces of open source programs.
DNASTAR's post-assembly editing program, SeqMan Pro (SMP), has a proven history in finishing bacterial
sized genomes although it currently lacks the scalability and all the needed functionality to tackle human
genome sized problems. The primary goal of this Fast Track proposal is to create a fully scalable version of
SMP for the automated finishing and annotation of de novo assembled large eukaryotic genomes while also
providing a manual editing platform when needed. During Phase I, we will develop two key prototypes: 1) a
new assembly file format, eBAM, which is interconvertible with the BAM format, but also is editable like our
SQD files and 2) a rapid reference-assisted contig scaffolding tool adapted from our proprietary Disk Sort
Alignment (DSA) algorithm. With that foundation, we will complete the transformation of SMP in Phase II by: 1)
refining the eBAM format for optimal editing performance, 2) building a new 64-bit version of the SMP editing
engine that incorporates the additional functionality necessary for post-assembly finishing of large eukaryotic
genomes including automated DSA-based scaffolding and phase-aware gap filling, contig joining and
haplotype refinement, 3) creating a new DSA-based genome aligner for rapidly aligning a finished sequence to
an annotated reference genome which together with 4) a new feature transfer and analysis module, will permit
initial annotation of the finished genome along with a cataloging of variants and their impact in both native and
reference coordinates. Inclusion of the reference coordinates allows variants in the new genome to be easily
associated with the wealth of information available through the numerous online knowledgebase resources.
我们正在进入一个个人基因组学的新时代,个人的基因组序列将用于
确定疾病易感性,改善诊断和更好地治疗疾病,并将其与
群体和人群,以确定新的生物标志物和任何表型背后的因果突变。尽管
将短读下一代测序(NGS)数据映射到参考上的巨大成功
基因组(重测序)在识别新基因组中的遗传变异时,
连接性与参考诱导的偏差一起使得获得完整的单倍型定相基因组
非常困难新兴的长读技术开始通过以下方式解决这一关键缺陷:
个体基因组的直接从头组装。然而,初始从头组装通常由以下组成:
成千上万的无序重叠群需要大量的组装后处理才能产生成品
可以有效挖掘基因内容和变异的序列。因此,迫切需要
集成的、可扩展的后组装软件,其1)自动组织、连接和分阶段初始重叠群
2)支持可选的NGS和/或手动抛光,以及3)提供初始的
这些序列的自动注释。目前,这种软件并不存在,相反,用户必须
拼凑出一系列难以使用的、特定任务的开源程序。
DNASTAR的组装后编辑程序SeqMan Pro(SMP)在整理细菌中具有良好的历史
虽然它目前缺乏可扩展性和解决人类基因组问题所需的所有功能,
基因组大小的问题。此快速通道提案的主要目标是创建一个完全可扩展的版本,
SMP用于从头组装的大型真核基因组的自动整理和注释,同时还
在需要时提供手动编辑平台。在第一阶段,我们将开发两个关键原型:1)a
新的汇编文件格式eBAM,它可以与BAM格式相互转换,但也可以像我们的
SQD文件和2)一个快速参考辅助重叠群支架工具,改编自我们专有的磁盘排序
比对(DSA)算法。在此基础上,我们将通过以下方式完成SMP第二阶段的转型:
改进eBAM格式以获得最佳编辑性能,2)构建SMP编辑的新64位版本
一种引擎,它包含了大型真核生物组装后精加工所需的附加功能
基因组,包括自动化的基于DSA的支架和相位感知的间隙填充,重叠群连接和
单倍型细化,3)创建新的基于DSA的基因组比对器,用于快速比对完成的序列,
注释的参考基因组与4)新的特征转移和分析模块一起将允许
完成的基因组的初始注释沿着变异的编目及其对天然和
参考坐标参考坐标的包含允许新基因组中的变体容易地被识别。
与通过众多在线知识库资源提供的丰富信息相关。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
TIMOTHY J DURFEE其他文献
TIMOTHY J DURFEE的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('TIMOTHY J DURFEE', 18)}}的其他基金
Long read based sequencing software for the comprehensive analysis of clinical samples
基于长读长的测序软件,用于临床样本的综合分析
- 批准号:
10009727 - 财政年份:2020
- 资助金额:
$ 75万 - 项目类别:
Scalable post-assembly editing software for finishing and annotating personal genomes
可扩展的组装后编辑软件,用于完成和注释个人基因组
- 批准号:
9883809 - 财政年份:2018
- 资助金额:
$ 75万 - 项目类别:
Complete genome de novo assembly software for the emerging long read sequencing era
适用于新兴长读长测序时代的完整基因组从头组装软件
- 批准号:
9255092 - 财政年份:2017
- 资助金额:
$ 75万 - 项目类别:
Complete genome de novo assembly software for the emerging long read sequencing era
适用于新兴长读长测序时代的完整基因组从头组装软件
- 批准号:
9747613 - 财政年份:2017
- 资助金额:
$ 75万 - 项目类别:
Association Analysis Software for Mining Clinical Next-Gen Sequencing Data
用于挖掘临床下一代测序数据的关联分析软件
- 批准号:
8236680 - 财政年份:2012
- 资助金额:
$ 75万 - 项目类别:
Association Analysis Software for Mining Clinical Next-Gen Sequencing Data
用于挖掘临床下一代测序数据的关联分析软件
- 批准号:
8727829 - 财政年份:2012
- 资助金额:
$ 75万 - 项目类别:
Association Analysis Software for Mining Clinical Next-Gen Sequencing Data
用于挖掘临床下一代测序数据的关联分析软件
- 批准号:
8703156 - 财政年份:2012
- 资助金额:
$ 75万 - 项目类别:
Association Analysis Software for Mining Clinical Next-Gen Sequencing Data
用于挖掘临床下一代测序数据的关联分析软件
- 批准号:
8624982 - 财政年份:2012
- 资助金额:
$ 75万 - 项目类别:
A Desktop Assembly and Analysis Pipeline for Next-gen Metagenomic Sequencing
用于下一代宏基因组测序的桌面组装和分析流程
- 批准号:
8200467 - 财政年份:2011
- 资助金额:
$ 75万 - 项目类别:
Integrated Assembly Software for Sanger and Next Generation Sequence Technologies
适用于 Sanger 和下一代序列技术的集成装配软件
- 批准号:
8011298 - 财政年份:2007
- 资助金额:
$ 75万 - 项目类别:
相似海外基金
DMS-EPSRC: Asymptotic Analysis of Online Training Algorithms in Machine Learning: Recurrent, Graphical, and Deep Neural Networks
DMS-EPSRC:机器学习中在线训练算法的渐近分析:循环、图形和深度神经网络
- 批准号:
EP/Y029089/1 - 财政年份:2024
- 资助金额:
$ 75万 - 项目类别:
Research Grant
CAREER: Blessing of Nonconvexity in Machine Learning - Landscape Analysis and Efficient Algorithms
职业:机器学习中非凸性的祝福 - 景观分析和高效算法
- 批准号:
2337776 - 财政年份:2024
- 资助金额:
$ 75万 - 项目类别:
Continuing Grant
CAREER: From Dynamic Algorithms to Fast Optimization and Back
职业:从动态算法到快速优化并返回
- 批准号:
2338816 - 财政年份:2024
- 资助金额:
$ 75万 - 项目类别:
Continuing Grant
CAREER: Structured Minimax Optimization: Theory, Algorithms, and Applications in Robust Learning
职业:结构化极小极大优化:稳健学习中的理论、算法和应用
- 批准号:
2338846 - 财政年份:2024
- 资助金额:
$ 75万 - 项目类别:
Continuing Grant
CRII: SaTC: Reliable Hardware Architectures Against Side-Channel Attacks for Post-Quantum Cryptographic Algorithms
CRII:SaTC:针对后量子密码算法的侧通道攻击的可靠硬件架构
- 批准号:
2348261 - 财政年份:2024
- 资助金额:
$ 75万 - 项目类别:
Standard Grant
CRII: AF: The Impact of Knowledge on the Performance of Distributed Algorithms
CRII:AF:知识对分布式算法性能的影响
- 批准号:
2348346 - 财政年份:2024
- 资助金额:
$ 75万 - 项目类别:
Standard Grant
CRII: CSR: From Bloom Filters to Noise Reduction Streaming Algorithms
CRII:CSR:从布隆过滤器到降噪流算法
- 批准号:
2348457 - 财政年份:2024
- 资助金额:
$ 75万 - 项目类别:
Standard Grant
EAGER: Search-Accelerated Markov Chain Monte Carlo Algorithms for Bayesian Neural Networks and Trillion-Dimensional Problems
EAGER:贝叶斯神经网络和万亿维问题的搜索加速马尔可夫链蒙特卡罗算法
- 批准号:
2404989 - 财政年份:2024
- 资助金额:
$ 75万 - 项目类别:
Standard Grant
CAREER: Efficient Algorithms for Modern Computer Architecture
职业:现代计算机架构的高效算法
- 批准号:
2339310 - 财政年份:2024
- 资助金额:
$ 75万 - 项目类别:
Continuing Grant
CAREER: Improving Real-world Performance of AI Biosignal Algorithms
职业:提高人工智能生物信号算法的实际性能
- 批准号:
2339669 - 财政年份:2024
- 资助金额:
$ 75万 - 项目类别:
Continuing Grant