权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

BIGDATA: Low-Memory Streaming Prefilters for Biological Sequencing Data

BIGDATA：生物测序数据的低内存流预过滤器

基本信息

批准号：
8703739
负责人：
C. Titus BROWN
金额：
$ 20.42万
依托单位：
MICHIGAN STATE UNIVERSITY
依托单位国家：
美国
项目类别：
财政年份：
2013
资助国家：
美国
起止时间：
2013-07-19 至 2015-05-31
项目状态：
已结题

项目摘要

DESCRIPTION (provided by applicant): We will soon be able to exhaustively sequence the DNA and RNA of entire communities of bacteria, as well as every individual cell of a tumor. Both of these very different applications of sequencing share in the need to rapidly and efficiently sort through large amounts of noisy sequence data (dozens to 100s of terabases) to separate signal from noise and produce biological insight. However, current bioinformatics approaches for extracting information from this data cannot easily handle the vast amounts of data being acquired. The primary challenges in processing this sequence data are twofold: the relatively high error rate of 0.1-1\%, per base, and the volume of data we can now easily acquire with sequencers such as lllumina HiSeq. For years, sequencing capacity has been doubling every 6 months -significantly faster than compute capacity. Since almost all extant bioinformatics analysis approaches require multiple passes across the primary data, and many analysis algorithms have not been parallelized, bioinformatics analysis capacity continues to lag ever further behind data generation capacity. In addition, many of the existing software packages cannot easily be retooled to take advantage of many core or GPU algorithms, and hence will not take advantage of expected advances in compute capacity and cyber infrastructure we propose to develop and implement novel streaming approaches for loss compression and error connection in shotgun sequencing data. Our algorithms are few-pass ($<$ 2), require no sample-specific information, and can be implemented in fixed or low memory; moreover, they are amenable to parallelization and can run efficiently in many core environments. When implemented as a prefilter to existing analysis packages, our approaches will eliminate or correct the majority of errors in data sets, dramatically reducing the computational space and time requirements for downstream analysis using existing packages. Moreover, we will provide novel capability by extending error correction approaches to mRNAseq and metagenomic data sets. Intellectual Merit: We will develop a range of algorithms for space- and time-efficient compression and error correction of short-read DNA and RNA sequence data. These strategies will substantially increase the scalability of many downstream analysis applications, ranging from community analysis of metagenomes to resequencing analysis of humans. We will provide analyses describing the tradeoffs between space and time efficiency and sensitivity, and deliver tested, documented reference implementations of our approaches that can be used by the community for practical evaluation and incorporation into analysis tools. Our approaches will significantly impact short-read sequence analysis by introducing efficient and effective streaming approaches to the two most common types of short-read analysis, mapping and assembly.

描述（由申请人提供）：我们很快就能对整个细菌群落以及肿瘤的每个细胞的 DNA 和 RNA 进行详尽的测序。这两种截然不同的测序应用都需要快速有效地对大量噪声序列数据（数十到数百个太碱基）进行分类，以将信号与噪声分开并产生生物学洞察力。然而，当前从这些数据中提取信息的生物信息学方法无法轻松处理所获取的大量数据。处理该序列数据的主要挑战有两个：每个碱基相对较高的错误率（0.1-1%），以及我们现在可以使用 lllumina HiSeq 等测序仪轻松获取的数据量。多年来，测序能力每 6 个月就会翻一番，明显快于计算能力。由于几乎所有现有的生物信息学分析方法都需要多次遍历原始数据，并且许多分析算法尚未并行化，因此生物信息学分析能力继续远远落后于数据生成能力。此外，许多现有软件包无法轻松地进行重组以利用许多核心或 GPU 算法，因此无法利用计算能力和网络基础设施方面的预期进步，我们建议开发和实施用于鸟枪测序数据中的丢失压缩和错误连接的新型流媒体方法。我们的算法是很少通过的（$<$2），不需要特定于样本的信息，并且可以在固定或低内存中实现；此外，它们适合并行化，并且可以在许多核心环境中高效运行。当作为现有分析包的预过滤器实现时，我们的方法将消除或纠正数据集中的大多数错误，从而大大减少使用现有包进行下游分析的计算空间和时间要求。此外，我们将通过将纠错方法扩展到 mRNAseq 和宏基因组数据集来提供新颖的功能。智力优势：我们将开发一系列算法，用于短读长 DNA 和 RNA 序列数据的空间和时间高效压缩和纠错。这些策略将大大提高许多下游分析应用的可扩展性，从宏基因组的群落分析到人类的重测序分析。我们将提供描述空间和时间效率与灵敏度之间权衡的分析，并提供我们方法的经过测试、记录的参考实现，可供社区用于实际评估和纳入分析工具。我们的方法将通过为两种最常见类型的短读分析（映射和组装）引入高效且有效的流方法来显着影响短读序列分析。