Position Sensitive P-Mer Frequency Clustering with Applications to Classification
位置敏感 P-Mer 频率聚类及其在分类中的应用
基本信息
- 批准号:8320160
- 负责人:
- 金额:$ 20.5万
- 依托单位:
- 依托单位国家:美国
- 项目类别:
- 财政年份:2011
- 资助国家:美国
- 起止时间:2011-08-16 至 2014-05-31
- 项目状态:已结题
- 来源:
- 关键词:AddressAlgorithmsBiodiversityClassificationComplementComplexComputational TechniqueComputersDNADNA SequenceDataDatabasesDevelopmentEffectivenessFamilyFrequenciesFutureGenomeGenomicsGrantGraphHabitatsHealthHigh Performance ComputingHumanHuman MicrobiomeInterventionLeadLearningLengthLibrariesLinkMachine LearningMetagenomicsMethodsMiningModelingOnline SystemsOrganismPositioning AttributeProbabilityProcessPropertyRNARNA SequencesResearchRibosomal RNASamplingScreening procedureSelection BiasSequence AlignmentSequence AnalysisSpecificityStreamStructureTaxonTechniquesTestingTimeUpdateWorkbasecomputing resourcescostimprovedlaptopmetagenomemicrobialmicrobiomenext generationnovelnovel strategiesprototyperesearch studystatisticssuccesstooluser-friendlyweb site
项目摘要
DESCRIPTION (provided by applicant):
Position Sensitive P-Mer Frequency Clustering with Applications to Classification and Differentiation Recent genomic sequencing advances, such as next generation sequencing, and projects like the Human Microbiome Project create extremely large genomic databases. Even though the length of any specific sequence may be much shorter than that of the complete DNA sequence of an organism, looking at enormous libraries of sequences, such as 16S rRNA, presents an equally (if not greater) computational challenge. In traditional genomic analysis, only one sequence may be analyzed at a time. When dealing with metagenomics, thousands (or more) sequences need to be analyzed at the same time. However, to study such problems as environmental biological diversity and human microbiome diversity this is exactly what is needed. Current techniques have several shortcomings which need to be addressed. Techniques involving sequence alignment are typically based on selection of one representative sequence (as is typically done when looking at 16S rRNA data) which introduces selection bias. Genomic databases involving multiple copies of 16S per organism across thousands of organisms, will soon grow too large to practically process just using computationally expensive alignment methods to match sequences, but faster alignment-free methods currently do not provide the needed accuracy and sensitivity. As a complement to existing methods we introduce a novel class of fast high-throughput algorithms based on quasi-alignment using position specific p-mer frequency clustering. Organisms are represented by a directed graph structure that summarizes the ordering between clusters of p-mer frequency histograms at different positions in sequences. This model can be learned using all available 16S copies of an organism and thus eliminates selection bias. Due to the added position information, these algorithms can be used for species (and even strain) classification facilitating the study of strain diversity within species. Our prototype implementation of this new technique shows that it is able to produce compact profiles which can be efficiently stored and used for large scale classification and differentiation down to the strain level. Since the technique incorporates high-throughput data stream clustering, a proven technique in high performance computing, it scales well for very large scale DNA/RNA sequence data as well as massive sets of short sequence snippets collected during metagenomic research. In this project we will develop a suite of tools, profile models, and scoring techniques to model RNA/DNA sequences providing applications of organism classification, and intra/inter-organism similarity/diversity. Our approach provides both the specificity needed to perform strain classification and still avoid the computational overhead of alignment. It is important to note that this is accomplished through dynamic online machine learning techniques without human intervention.
描述(由申请人提供):
位置敏感P-Mer频率聚类及其在分类和区分中的应用最近的基因组测序进展,如下一代测序,以及人类微生物组计划等项目创建了非常大的基因组数据库。尽管任何特定序列的长度可能比生物体的完整DNA序列的长度短得多,但查看庞大的序列库(如16 S rRNA)也会带来同样(如果不是更大)的计算挑战。在传统的基因组分析中,一次只能分析一个序列。在处理宏基因组学时,需要同时分析数千个(或更多)序列。然而,要研究环境生物多样性和人类微生物组多样性等问题,这正是所需要的。目前的技术有几个缺点,需要加以解决。涉及序列比对的技术通常基于一个代表性序列的选择(如在查看16 S rRNA数据时通常所做的),这引入了选择偏倚。涉及数千种生物体中每个生物体的多个16 S拷贝的基因组数据库将很快变得太大,以至于实际上无法仅使用计算昂贵的比对方法来匹配序列,但是更快的免比对方法目前不能提供所需的准确性和灵敏度。作为现有方法的补充,我们引入了一类新的快速高通量算法的基础上准对齐使用特定位置的p-mer频率聚类。生物体由有向图结构表示,该有向图结构总结了序列中不同位置处的p-mer频率直方图簇之间的排序。该模型可以使用生物体的所有可用的16 S拷贝来学习,从而消除选择偏差。由于增加了位置信息,这些算法可以用于物种(甚至菌株)分类,促进物种内菌株多样性的研究。这种新技术的原型实现表明,它能够产生紧凑的配置文件,可以有效地存储和用于大规模的分类和分化的应变水平。由于该技术结合了高通量数据流聚类,这是高性能计算中一种经过验证的技术,因此它可以很好地扩展到非常大规模的DNA/RNA序列数据以及宏基因组研究期间收集的大量短序列片段。在这个项目中,我们将开发一套工具,配置文件模型和评分技术来模拟RNA/DNA序列,提供生物体分类和生物体内/间相似性/多样性的应用。我们的方法既提供了进行菌株分类所需的特异性,又避免了比对的计算开销。值得注意的是,这是通过动态在线机器学习技术实现的,无需人工干预。
项目成果
期刊论文数量(1)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
MARGARET Holder DUNHAM其他文献
MARGARET Holder DUNHAM的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('MARGARET Holder DUNHAM', 18)}}的其他基金
Position Sensitive P-Mer Frequency Clustering with Applications to Classification
位置敏感 P-Mer 频率聚类及其在分类中的应用
- 批准号:
8192895 - 财政年份:2011
- 资助金额:
$ 20.5万 - 项目类别:
相似海外基金
CAREER: Blessing of Nonconvexity in Machine Learning - Landscape Analysis and Efficient Algorithms
职业:机器学习中非凸性的祝福 - 景观分析和高效算法
- 批准号:
2337776 - 财政年份:2024
- 资助金额:
$ 20.5万 - 项目类别:
Continuing Grant
CAREER: From Dynamic Algorithms to Fast Optimization and Back
职业:从动态算法到快速优化并返回
- 批准号:
2338816 - 财政年份:2024
- 资助金额:
$ 20.5万 - 项目类别:
Continuing Grant
CAREER: Structured Minimax Optimization: Theory, Algorithms, and Applications in Robust Learning
职业:结构化极小极大优化:稳健学习中的理论、算法和应用
- 批准号:
2338846 - 财政年份:2024
- 资助金额:
$ 20.5万 - 项目类别:
Continuing Grant
CRII: SaTC: Reliable Hardware Architectures Against Side-Channel Attacks for Post-Quantum Cryptographic Algorithms
CRII:SaTC:针对后量子密码算法的侧通道攻击的可靠硬件架构
- 批准号:
2348261 - 财政年份:2024
- 资助金额:
$ 20.5万 - 项目类别:
Standard Grant
CRII: AF: The Impact of Knowledge on the Performance of Distributed Algorithms
CRII:AF:知识对分布式算法性能的影响
- 批准号:
2348346 - 财政年份:2024
- 资助金额:
$ 20.5万 - 项目类别:
Standard Grant
CRII: CSR: From Bloom Filters to Noise Reduction Streaming Algorithms
CRII:CSR:从布隆过滤器到降噪流算法
- 批准号:
2348457 - 财政年份:2024
- 资助金额:
$ 20.5万 - 项目类别:
Standard Grant
EAGER: Search-Accelerated Markov Chain Monte Carlo Algorithms for Bayesian Neural Networks and Trillion-Dimensional Problems
EAGER:贝叶斯神经网络和万亿维问题的搜索加速马尔可夫链蒙特卡罗算法
- 批准号:
2404989 - 财政年份:2024
- 资助金额:
$ 20.5万 - 项目类别:
Standard Grant
CAREER: Efficient Algorithms for Modern Computer Architecture
职业:现代计算机架构的高效算法
- 批准号:
2339310 - 财政年份:2024
- 资助金额:
$ 20.5万 - 项目类别:
Continuing Grant
CAREER: Improving Real-world Performance of AI Biosignal Algorithms
职业:提高人工智能生物信号算法的实际性能
- 批准号:
2339669 - 财政年份:2024
- 资助金额:
$ 20.5万 - 项目类别:
Continuing Grant
DMS-EPSRC: Asymptotic Analysis of Online Training Algorithms in Machine Learning: Recurrent, Graphical, and Deep Neural Networks
DMS-EPSRC:机器学习中在线训练算法的渐近分析:循环、图形和深度神经网络
- 批准号:
EP/Y029089/1 - 财政年份:2024
- 资助金额:
$ 20.5万 - 项目类别:
Research Grant














{{item.name}}会员




