Position Sensitive P-Mer Frequency Clustering with Applications to Classification

位置敏感 P-Mer 频率聚类及其在分类中的应用

基本信息

  • 批准号:
    8192895
  • 负责人:
  • 金额:
    $ 18.07万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
  • 财政年份:
    2011
  • 资助国家:
    美国
  • 起止时间:
    2011-08-16 至 2013-05-31
  • 项目状态:
    已结题

项目摘要

DESCRIPTION (provided by applicant): Position Sensitive P-Mer Frequency Clustering with Applications to Classification and Differentiation Recent genomic sequencing advances, such as next generation sequencing, and projects like the Human Microbiome Project create extremely large genomic databases. Even though the length of any specific sequence may be much shorter than that of the complete DNA sequence of an organism, looking at enormous libraries of sequences, such as 16S rRNA, presents an equally (if not greater) computational challenge. In traditional genomic analysis, only one sequence may be analyzed at a time. When dealing with metagenomics, thousands (or more) sequences need to be analyzed at the same time. However, to study such problems as environmental biological diversity and human microbiome diversity this is exactly what is needed. Current techniques have several shortcomings which need to be addressed. Techniques involving sequence alignment are typically based on selection of one representative sequence (as is typically done when looking at 16S rRNA data) which introduces selection bias. Genomic databases involving multiple copies of 16S per organism across thousands of organisms, will soon grow too large to practically process just using computationally expensive alignment methods to match sequences, but faster alignment-free methods currently do not provide the needed accuracy and sensitivity. As a complement to existing methods we introduce a novel class of fast high-throughput algorithms based on quasi-alignment using position specific p-mer frequency clustering. Organisms are represented by a directed graph structure that summarizes the ordering between clusters of p-mer frequency histograms at different positions in sequences. This model can be learned using all available 16S copies of an organism and thus eliminates selection bias. Due to the added position information, these algorithms can be used for species (and even strain) classification facilitating the study of strain diversity within species. Our prototype implementation of this new technique shows that it is able to produce compact profiles which can be efficiently stored and used for large scale classification and differentiation down to the strain level. Since the technique incorporates high-throughput data stream clustering, a proven technique in high performance computing, it scales well for very large scale DNA/RNA sequence data as well as massive sets of short sequence snippets collected during metagenomic research. In this project we will develop a suite of tools, profile models, and scoring techniques to model RNA/DNA sequences providing applications of organism classification, and intra/inter-organism similarity/diversity. Our approach provides both the specificity needed to perform strain classification and still avoid the computational overhead of alignment. It is important to note that this is accomplished through dynamic online machine learning techniques without human intervention. PUBLIC HEALTH RELEVANCE: Recent advances in Metagenomics and the Human Microbiome provide a complex landscape for dealing with a multitude of genomes all at once. One of the many challenges in this field is classification of the genomes present in the sample. Effective metagenomic classification and diversity analysis require complex representations of taxa. The significance of our research is that we develop a suite of tools, based on novel alignment free techniques that will be applied to environmental metagenomics samples as well as human microbiome samples. Providing such methods to rapidly classify organisms using our new approach on a laptop computer instead of several multi-processor servers will facilitate the rapid development of microbiome-based health screening in the near future.
描述(由申请人提供): 位置敏感的 P-Mer 频率聚类及其在分类和分化中的应用最近的基因组测序进展(例如下一代测序)以及人类微生物组计划等项目创建了极其庞大的基因组数据库。尽管任何特定序列的长度可能比生物体的完整 DNA 序列短得多,但查看巨大的序列库(例如 16S rRNA)也提出了同样(如果不是更大)的计算挑战。在传统的基因组分析中,一次只能分析一个序列。在处理宏基因组学时,需要同时分析数千个(或更多)序列。然而,为了研究环境生物多样性和人类微生物组多样性等问题,这正是所需要的。当前的技术有几个需要解决的缺点。涉及序列比对的技术通常基于选择一个代表性序列(如查看 16S rRNA 数据时通常所做的那样),这会引入选择偏差。基因组数据库涉及数千个生物体中每个生物体的多个 16S 拷贝,很快就会变得太大,无法仅使用计算成本昂贵的比对方法来匹配序列,但目前更快的免比对方法无法提供所需的准确性和灵敏度。作为对现有方法的补充,我们引入了一类新颖的快速高通量算法,该算法基于使用位置特定的 p-mer 频率聚类的准比对。生物体由有向图结构表示,该结构总结了序列中不同位置的 p-mer 频率直方图簇之间的排序。该模型可以使用生物体所有可用的 16S 副本来学习,从而消除选择偏差。由于添加了位置信息,这些算法可用于物种(甚至菌株)分类,促进物种内菌株多样性的研究。我们对这项新技术的原型实施表明,它能够产生紧凑的轮廓,可以有效地存储并用于大规模分类和区分到应变水平。由于该技术结合了高通量数据流聚类(高性能计算中的一种经过验证的技术),因此它可以很好地扩展用于超大规模 DNA/RNA 序列数据以及在宏基因组研究期间收集的大量短序列片段。在这个项目中,我们将开发一套工具、概况模型和评分技术来对 RNA/DNA 序列进行建模,提供生物体分类和生物体内/生物间相似性/多样性的应用。我们的方法既提供了执行应变分类所需的特异性,又避免了比对的计算开销。值得注意的是,这是通过动态在线机器学习技术实现的,无需人工干预。 公共卫生相关性: 宏基因组学和人类微生物组的最新进展为同时处理多个基因组提供了复杂的环境。该领域的众多挑战之一是样本中存在的基因组的分类。有效的宏基因组分类和多样性分析需要类群的复杂表示。我们研究的意义在于,我们开发了一套基于新颖的免比对技术的工具,这些工具将应用于环境宏基因组样本以及人类微生物组样本。使用我们的新方法在笔记本电脑上而不是多个多处理器服务器上提供这种快速分类生物体的方法,将在不久的将来促进基于微生物组的健康筛查的快速发展。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

MARGARET Holder DUNHAM其他文献

MARGARET Holder DUNHAM的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('MARGARET Holder DUNHAM', 18)}}的其他基金

Position Sensitive P-Mer Frequency Clustering with Applications to Classification
位置敏感 P-Mer 频率聚类及其在分类中的应用
  • 批准号:
    8320160
  • 财政年份:
    2011
  • 资助金额:
    $ 18.07万
  • 项目类别:

相似海外基金

DMS-EPSRC: Asymptotic Analysis of Online Training Algorithms in Machine Learning: Recurrent, Graphical, and Deep Neural Networks
DMS-EPSRC:机器学习中在线训练算法的渐近分析:循环、图形和深度神经网络
  • 批准号:
    EP/Y029089/1
  • 财政年份:
    2024
  • 资助金额:
    $ 18.07万
  • 项目类别:
    Research Grant
CAREER: Blessing of Nonconvexity in Machine Learning - Landscape Analysis and Efficient Algorithms
职业:机器学习中非凸性的祝福 - 景观分析和高效算法
  • 批准号:
    2337776
  • 财政年份:
    2024
  • 资助金额:
    $ 18.07万
  • 项目类别:
    Continuing Grant
CAREER: From Dynamic Algorithms to Fast Optimization and Back
职业:从动态算法到快速优化并返回
  • 批准号:
    2338816
  • 财政年份:
    2024
  • 资助金额:
    $ 18.07万
  • 项目类别:
    Continuing Grant
CAREER: Structured Minimax Optimization: Theory, Algorithms, and Applications in Robust Learning
职业:结构化极小极大优化:稳健学习中的理论、算法和应用
  • 批准号:
    2338846
  • 财政年份:
    2024
  • 资助金额:
    $ 18.07万
  • 项目类别:
    Continuing Grant
CRII: SaTC: Reliable Hardware Architectures Against Side-Channel Attacks for Post-Quantum Cryptographic Algorithms
CRII:SaTC:针对后量子密码算法的侧通道攻击的可靠硬件架构
  • 批准号:
    2348261
  • 财政年份:
    2024
  • 资助金额:
    $ 18.07万
  • 项目类别:
    Standard Grant
CRII: AF: The Impact of Knowledge on the Performance of Distributed Algorithms
CRII:AF:知识对分布式算法性能的影响
  • 批准号:
    2348346
  • 财政年份:
    2024
  • 资助金额:
    $ 18.07万
  • 项目类别:
    Standard Grant
CRII: CSR: From Bloom Filters to Noise Reduction Streaming Algorithms
CRII:CSR:从布隆过滤器到降噪流算法
  • 批准号:
    2348457
  • 财政年份:
    2024
  • 资助金额:
    $ 18.07万
  • 项目类别:
    Standard Grant
EAGER: Search-Accelerated Markov Chain Monte Carlo Algorithms for Bayesian Neural Networks and Trillion-Dimensional Problems
EAGER:贝叶斯神经网络和万亿维问题的搜索加速马尔可夫链蒙特卡罗算法
  • 批准号:
    2404989
  • 财政年份:
    2024
  • 资助金额:
    $ 18.07万
  • 项目类别:
    Standard Grant
CAREER: Efficient Algorithms for Modern Computer Architecture
职业:现代计算机架构的高效算法
  • 批准号:
    2339310
  • 财政年份:
    2024
  • 资助金额:
    $ 18.07万
  • 项目类别:
    Continuing Grant
CAREER: Improving Real-world Performance of AI Biosignal Algorithms
职业:提高人工智能生物信号算法的实际性能
  • 批准号:
    2339669
  • 财政年份:
    2024
  • 资助金额:
    $ 18.07万
  • 项目类别:
    Continuing Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了