权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Scalable post-assembly editing software for finishing and annotating personal genomes

可扩展的组装后编辑软件，用于完成和注释个人基因组

基本信息

批准号：
9767335
负责人：
TIMOTHY J DURFEE
金额：
$ 75万
依托单位：
DNASTAR, INC.
依托单位国家：
美国
项目类别：
财政年份：
2018
资助国家：
美国
起止时间：
2018-09-01 至 2021-02-28
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/9767335
关键词：
Address Algorithms Alleles Automated Annotation Awareness Bacterial Genome Base Sequence Biological Markers Cataloging Catalogs Chromosomes, Human, Pair 12 Complement Complex Computer software Computers Consensus Sequence DNA Resequencing DNA sequencing Data Diagnosis Diploidy Disease susceptibility Foundations Generations Genes Genetic Genetic Variation Genome Genomics Glean Goals Haplotypes Hour Human Genome Imagery Individual Manuals Maps Performance Persons Phase Phenotype Polishes Population Proteins Recording of previous events Resources Running Technology Variant Writing base causal variant cohort contig cost design experience file format genome annotation graphical user interface human disease improved knowledge base next generation sequencing open source personalized medicine programs prototype reference genome scaffold success tool whole genome

项目摘要

We are entering a new era of personal genomics where an individual's genome sequence will be used to identify disease susceptibility, improve diagnosis and better treat illnesses as well as be combined across cohorts and populations to identify new biomarkers and causal mutations underlying any phenotype. Despite the tremendous success of mapping short read next-generation sequencing (NGS) data onto a reference genome (resequencing) in identifying genetic variation in a new genome, the inherent lack of long range connectivity together with reference-induced biases make obtaining complete haplotype-phased genomes exceedingly difficult. Emerging long read technologies are beginning to address this critical shortcoming by direct de novo assembly of an individual's genome. However, initial de novo assemblies typically consist of many thousands of unordered contigs that require extensive post-assembly processing to produce finished sequences that can be effectively mined for genetic content and variation. Thus, there is an urgent need for integrated, scalable post-assembly software that 1) automatically organizes, joins and phases the initial contigs into complete haplotype sequences, 2) supports optional NGS and/or manual polishing and 3) provides initial automated annotation of those sequences. Currently, such software does not exist and instead users must cobble together a confusing array of difficult-to-use, task-specific pieces of open source programs. DNASTAR's post-assembly editing program, SeqMan Pro (SMP), has a proven history in finishing bacterial sized genomes although it currently lacks the scalability and all the needed functionality to tackle human genome sized problems. The primary goal of this Fast Track proposal is to create a fully scalable version of SMP for the automated finishing and annotation of de novo assembled large eukaryotic genomes while also providing a manual editing platform when needed. During Phase I, we will develop two key prototypes: 1) a new assembly file format, eBAM, which is interconvertible with the BAM format, but also is editable like our SQD files and 2) a rapid reference-assisted contig scaffolding tool adapted from our proprietary Disk Sort Alignment (DSA) algorithm. With that foundation, we will complete the transformation of SMP in Phase II by: 1) refining the eBAM format for optimal editing performance, 2) building a new 64-bit version of the SMP editing engine that incorporates the additional functionality necessary for post-assembly finishing of large eukaryotic genomes including automated DSA-based scaffolding and phase-aware gap filling, contig joining and haplotype refinement, 3) creating a new DSA-based genome aligner for rapidly aligning a finished sequence to an annotated reference genome which together with 4) a new feature transfer and analysis module, will permit initial annotation of the finished genome along with a cataloging of variants and their impact in both native and reference coordinates. Inclusion of the reference coordinates allows variants in the new genome to be easily associated with the wealth of information available through the numerous online knowledgebase resources.

我们正在进入一个个人基因组学的新时代，个人的基因组序列将用于确定疾病易感性，改善诊断和更好地治疗疾病，并将其与群体和人群，以确定新的生物标志物和任何表型背后的因果突变。尽管将短读下一代测序（NGS）数据映射到参考上的巨大成功基因组（重测序）在识别新基因组中的遗传变异时，连接性与参考诱导的偏差一起使得获得完整的单倍型定相基因组非常困难新兴的长读技术开始通过以下方式解决这一关键缺陷：个体基因组的直接从头组装。然而，初始从头组装通常由以下组成：成千上万的无序重叠群需要大量的组装后处理才能产生成品可以有效挖掘基因内容和变异的序列。因此，迫切需要集成的、可扩展的后组装软件，其1）自动组织、连接和分阶段初始重叠群 2）支持可选的NGS和/或手动抛光，以及3）提供初始的这些序列的自动注释。目前，这种软件并不存在，相反，用户必须拼凑出一系列难以使用的、特定任务的开源程序。 DNASTAR的组装后编辑程序SeqMan Pro（SMP）在整理细菌中具有良好的历史虽然它目前缺乏可扩展性和解决人类基因组问题所需的所有功能，基因组大小的问题。此快速通道提案的主要目标是创建一个完全可扩展的版本， SMP用于从头组装的大型真核基因组的自动整理和注释，同时还在需要时提供手动编辑平台。在第一阶段，我们将开发两个关键原型：1）a 新的汇编文件格式eBAM，它可以与BAM格式相互转换，但也可以像我们的 SQD文件和2）一个快速参考辅助重叠群支架工具，改编自我们专有的磁盘排序比对（DSA）算法。在此基础上，我们将通过以下方式完成SMP第二阶段的转型：改进eBAM格式以获得最佳编辑性能，2）构建SMP编辑的新64位版本一种引擎，它包含了大型真核生物组装后精加工所需的附加功能基因组，包括自动化的基于DSA的支架和相位感知的间隙填充，重叠群连接和单倍型细化，3）创建新的基于DSA的基因组比对器，用于快速比对完成的序列，注释的参考基因组与4）新的特征转移和分析模块一起将允许完成的基因组的初始注释沿着变异的编目及其对天然和参考坐标参考坐标的包含允许新基因组中的变体容易地被识别。与通过众多在线知识库资源提供的丰富信息相关。