权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Algorithms and Software for Provably Accurate De Novo RNA-Seq Assembly

用于可证明准确的 De Novo RNA-Seq 组装的算法和软件

基本信息

批准号：
9145263
负责人：
Sreeram Kannan
金额：
$ 45.88万
依托单位：
UNIVERSITY OF CALIFORNIA BERKELEY
依托单位国家：
美国
项目类别：
财政年份：
2015
资助国家：
美国
起止时间：
2015-09-16 至 2018-06-30
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/9145263
关键词：
Address Algorithmic Software Algorithms Alternative Splicing Animal Model Automobile Driving Benchmarking Biological Assay Biological Sciences Complex Complication Computer software Data Data Set Detection Development Diagnostic Evaluation Foundations Funding Gene Structure Generations Genes Genome Government Health High-Throughput Nucleotide Sequencing Human Individual Industry Information Theory Joints Lead Length Letters Malignant Neoplasms Measurement Memory Methodology Methods Monitor Organism Performance Process Protein Isoforms Pythons RNA Reading Sampling Side Speed Structure System Technology Testing Time Tissues Transcript Work Writing base clinical application design heuristics improved insight nanopore novel strategies parallelization personalized medicine programs prototype reconstruction reference genome research study software development theories transcriptome transcriptome sequencing transcriptomics

项目摘要

DESCRIPTION (provided by applicant): RNA-Seq has revolutionized transcriptomics and is one of the most important high-throughput sequencing assays invented in recent years. The key computational problem is that of de novo assembly: the reconstruction of the transcripts and their abundances from tens to hundreds of millions of short reads. The problem is challenging due to a confluence of several factors: large number of different transcripts (tens of thousands), long repeat across transcripts due to alternative splicing, widely varying abundances across transcripts, and the presence of read errors. Existing assemblers are mostly designed based on heuristic considerations and implement ad hoc methods that lead to unreliable transcriptome reconstructions. An accurate RNA-Seq assembler would enable more accurate identification of fusions in cancer transcriptomes, better gene annotations in model and non-model organisms, and more complete analyses of the dynamics of alternative splicing driving developmental and regulatory programs. In this proposal, we offer a systematic approach to the design of RNA-Seq assemblers based on information theoretic principles. We start by determining conditions data that guarantee that there enough information to reconstruct the transcriptome, and then propose an assembly algorithm that can reconstruct with the minimal information. This algorithm optimally uses the available read information to resolve repeats and disambiguate isoforms. A key insight derived from the information theoretic approach is that widely varying abundances across transcripts, rather than a complication, can actually be exploited as signatures of different transcripts to disambiguate among them. Based on our initial ideas, we have built, evaluated and compared an initial prototype with several existing software, on both real and simulated data. The encouraging results provide evidence that our approach, which we will fully develop, implement and evaluated during the funded period, can significantly outperform existing software. Additional functionalities such as mixed short/long read assembly, genome-assisted assembly and joint processing of multiple RNA samples, will be designed and incorporated into the software as part of the proposed project.

描述（由申请人提供）：RNA-Seq彻底改变了转录组学，是近年来发明的最重要的高通量测序测定之一。关键的计算问题是从头组装：从数千万到数亿个短读段重建转录本及其丰度。这个问题是具有挑战性的，由于几个因素的汇合：大量的不同的转录本（数万），长重复跨转录本由于选择性剪接，广泛变化的丰度跨转录本，和存在的读取错误。现有的组装器大多是基于启发式考虑而设计的，并实施导致转录组重建不可靠的临时方法。准确的RNA-Seq组装器将能够更准确地识别癌症转录组中的融合，更好地注释模型和非模型生物体中的基因，以及更完整地分析驱动发育和调控程序的选择性剪接的动态。在这个建议中，我们提供了一个系统的方法来设计RNA-Seq汇编器的基础上信息理论的原则。我们首先确定条件数据，保证有足够的信息来重建转录组，然后提出了一个组装算法，可以用最少的信息重建。该算法最佳地使用可用的读段信息来解析重复序列并消除异构体的歧义。从信息论方法中得出的一个关键见解是，不同转录本之间的丰度差异很大，而不是复杂化，实际上可以被利用为不同转录本的签名，以消除它们之间的歧义。根据我们的初步想法，我们已经建立，评估和比较了一个初始原型与现有的几个软件，对真实的和模拟数据。令人鼓舞的结果证明，我们的方法，我们将充分开发，实施和评估在资助期间，可以显着优于现有的软件。其他功能，如混合短/长读段组装，基因组辅助组装和多个RNA样本的联合处理，将被设计并纳入软件，作为拟议项目的一部分。