权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Phylogenetic Binning of Metagenomic Sequence Data

宏基因组序列数据的系统发育分箱

基本信息

批准号：
7708544
负责人：
Eric Ellsworth Allen
金额：
$ 18.71万
依托单位：
UNIVERSITY OF CALIFORNIA, SAN DIEGO
依托单位国家：
美国
项目类别：
财政年份：
2009
资助国家：
美国
起止时间：
2009-08-24 至 2011-07-31
项目状态：
已结题

项目摘要

DESCRIPTION (provided by applicant): Culture-independent metagenomic studies are essential for understanding our relationship with the organisms comprising the human microbiome, defining optimal microbial composition to maintain health, and devising selective treatment strategies to eliminate pathogens without harming beneficial species. To use metagenomic data effectively, raw DNA sequence data (reads) must be processed computationally (assembled) to obtain longer sequences (contigs). Existing software packages for this purpose are quite inefficient when presented with large, taxonomically diverse samples, resulting in considerable wastage of reads that cannot be assembled. Efforts to maximize assembly efficiency by relaxing stringency can lead to inappropriate joining of sequences from unrelated organisms (chimeric artifacts), compromising data accuracy and usefulness. Taxonomic binning of raw reads as a pre-filtering step is expected to improve metagenomic sequence assembly efficiency, reducing statistical noise due to sample complexity and allowing incorporation of raw reads into longer, more informative contigs without incurring chimeric artifacts. Benefits should be especially significant for less abundant species in complex mixtures. We have developed methods to quantify taxonomic binning program performance and assembly improvements in real metagenomic data sets, including reproducible calibration standards, to enable efficient parameter optimization for existing software and provide reliable benchmarks for future software development. Our specific aims are to 1) develop new computational methods for large-scale taxonomic classification of metagenomic sequence data, applicable to raw reads as well as assembled contigs; 2) develop software and protocols to use taxonomic data binning as a pre-treatment to increase efficiency of existing sequence assembly software; 3) benchmark performance enhancement for different assembly software programs using quantitative, statistical tests with both artificially created models and real-life metagenomic data sets of varying size and complexity; 4) make new computational methods and performance evaluation tools available to the general scientific community.

描述（由申请人提供）：培养独立的宏基因组研究对于理解我们与组成人类微生物组的生物体的关系，定义维持健康的最佳微生物组成以及设计选择性治疗策略以消除病原体而不伤害有益物种至关重要。为了有效地使用宏基因组数据，原始DNA序列数据（reads）必须经过计算处理（组装）以获得更长的序列（contigs）。用于此目的的现有软件包在处理大量分类上不同的样本时效率非常低，导致无法组装的读取大量浪费。通过放松严格性来最大化装配效率的努力可能会导致不相关生物（嵌合产物）序列的不适当连接，从而损害数据的准确性和实用性。作为预过滤步骤的原始reads分类学合并有望提高宏基因组序列组装效率，减少由于样本复杂性而产生的统计噪声，并允许将原始reads合并到更长、更有信息量的contigs中，而不会产生嵌合伪影。在复杂的混合物中，对于数量较少的物种，效益尤其显著。我们已经开发了量化分类分类程序性能和组装改进的方法，包括可重复的校准标准，以便对现有软件进行有效的参数优化，并为未来的软件开发提供可靠的基准。我们的具体目标是：1)开发新的计算方法，用于宏基因组序列数据的大规模分类分类，适用于原始reads和组装的contigs；2)开发软件和协议，利用分类学数据分组作为预处理，提高现有序列组装软件的效率；3)使用人工创建的模型和不同大小和复杂性的现实宏基因组数据集进行定量统计测试，对不同的装配软件程序进行基准性能增强；4)为一般科学界提供新的计算方法和性能评估工具。