权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Position Sensitive P-Mer Frequency Clustering with Applications to Classification

位置敏感 P-Mer 频率聚类及其在分类中的应用

基本信息

批准号：
8320160
负责人：
MARGARET Holder DUNHAM
金额：
$ 20.5万
依托单位：
SOUTHERN METHODIST UNIVERSITY
依托单位国家：
美国
项目类别：
财政年份：
2011
资助国家：
美国
起止时间：
2011-08-16 至 2014-05-31
项目状态：
已结题

项目摘要

DESCRIPTION (provided by applicant): Position Sensitive P-Mer Frequency Clustering with Applications to Classification and Differentiation Recent genomic sequencing advances, such as next generation sequencing, and projects like the Human Microbiome Project create extremely large genomic databases. Even though the length of any specific sequence may be much shorter than that of the complete DNA sequence of an organism, looking at enormous libraries of sequences, such as 16S rRNA, presents an equally (if not greater) computational challenge. In traditional genomic analysis, only one sequence may be analyzed at a time. When dealing with metagenomics, thousands (or more) sequences need to be analyzed at the same time. However, to study such problems as environmental biological diversity and human microbiome diversity this is exactly what is needed. Current techniques have several shortcomings which need to be addressed. Techniques involving sequence alignment are typically based on selection of one representative sequence (as is typically done when looking at 16S rRNA data) which introduces selection bias. Genomic databases involving multiple copies of 16S per organism across thousands of organisms, will soon grow too large to practically process just using computationally expensive alignment methods to match sequences, but faster alignment-free methods currently do not provide the needed accuracy and sensitivity. As a complement to existing methods we introduce a novel class of fast high-throughput algorithms based on quasi-alignment using position specific p-mer frequency clustering. Organisms are represented by a directed graph structure that summarizes the ordering between clusters of p-mer frequency histograms at different positions in sequences. This model can be learned using all available 16S copies of an organism and thus eliminates selection bias. Due to the added position information, these algorithms can be used for species (and even strain) classification facilitating the study of strain diversity within species. Our prototype implementation of this new technique shows that it is able to produce compact profiles which can be efficiently stored and used for large scale classification and differentiation down to the strain level. Since the technique incorporates high-throughput data stream clustering, a proven technique in high performance computing, it scales well for very large scale DNA/RNA sequence data as well as massive sets of short sequence snippets collected during metagenomic research. In this project we will develop a suite of tools, profile models, and scoring techniques to model RNA/DNA sequences providing applications of organism classification, and intra/inter-organism similarity/diversity. Our approach provides both the specificity needed to perform strain classification and still avoid the computational overhead of alignment. It is important to note that this is accomplished through dynamic online machine learning techniques without human intervention.

描述（由申请人提供）：位置敏感P-Mer频率聚类及其在分类和区分中的应用最近的基因组测序进展，如下一代测序，以及人类微生物组计划等项目创建了非常大的基因组数据库。尽管任何特定序列的长度可能比生物体的完整DNA序列的长度短得多，但查看庞大的序列库（如16 S rRNA）也会带来同样（如果不是更大）的计算挑战。在传统的基因组分析中，一次只能分析一个序列。在处理宏基因组学时，需要同时分析数千个（或更多）序列。然而，要研究环境生物多样性和人类微生物组多样性等问题，这正是所需要的。目前的技术有几个缺点，需要加以解决。涉及序列比对的技术通常基于一个代表性序列的选择（如在查看16 S rRNA数据时通常所做的），这引入了选择偏倚。涉及数千种生物体中每个生物体的多个16 S拷贝的基因组数据库将很快变得太大，以至于实际上无法仅使用计算昂贵的比对方法来匹配序列，但是更快的免比对方法目前不能提供所需的准确性和灵敏度。作为现有方法的补充，我们引入了一类新的快速高通量算法的基础上准对齐使用特定位置的p-mer频率聚类。生物体由有向图结构表示，该有向图结构总结了序列中不同位置处的p-mer频率直方图簇之间的排序。该模型可以使用生物体的所有可用的16 S拷贝来学习，从而消除选择偏差。由于增加了位置信息，这些算法可以用于物种（甚至菌株）分类，促进物种内菌株多样性的研究。这种新技术的原型实现表明，它能够产生紧凑的配置文件，可以有效地存储和用于大规模的分类和分化的应变水平。由于该技术结合了高通量数据流聚类，这是高性能计算中一种经过验证的技术，因此它可以很好地扩展到非常大规模的DNA/RNA序列数据以及宏基因组研究期间收集的大量短序列片段。在这个项目中，我们将开发一套工具，配置文件模型和评分技术来模拟RNA/DNA序列，提供生物体分类和生物体内/间相似性/多样性的应用。我们的方法既提供了进行菌株分类所需的特异性，又避免了比对的计算开销。值得注意的是，这是通过动态在线机器学习技术实现的，无需人工干预。