ITR: Machine learning approaches to protein sequence comparison: discriminative, semi-supervised, scalable algorithms

ITR:蛋白质序列比较的机器学习方法:判别性、半监督、可扩展算法

基本信息

  • 批准号:
    0312706
  • 负责人:
  • 金额:
    $ 30万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Continuing Grant
  • 财政年份:
    2003
  • 资助国家:
    美国
  • 起止时间:
    2003-09-15 至 2007-08-31
  • 项目状态:
    已结题

项目摘要

PI: Christina LeslieCo-PI: William Stafford NobleCollaborator: Jason WestonCollaborator: Andre ElisseeffITR: Machine learning approaches to protein sequence comparison: discriminative, semi-supervised, scalable algorithmsResearch Goals. Pairwise sequence comparison is the ``killer app'' of bioinformatics. In this task, the user queries a protein database with a single sequence, and the algorithm returns a ranked list of sequences thatare likely to be evolutionarily related to the query. Two sequences that are descended from a common ancestral sequence - even though their sequence similarity may be subtle -- are likely to have similar three-dimensional structures and fill similar functional roles in the cell. Hence, recognizing subtle sequence similarities is useful for inferring protein evolution, function and structure.Almost all existing algorithms for pairwise sequence comparison fall into one of two categories: heuristic alignment algorithms that are scalable to large databases but can fail to capture subtle protein similarities; and approaches based on protein family models, which are accurate for determining whether a sequence fits a particular family model but cannot evaluate similarity between two unannotated proteins. The popular PSI-BLAST algorithm is a hybrid of the two approaches: it tries to iteratively build a model from a single query sequence on the fly and then searches the database for sequences that fit the model. While efficient, PSI-BLAST is known not to be the most accurate method for detecting more remote protein relationships.The approach that we pursue in this proposal is fundamentally new: we use machine learning algorithms to train offline on examples from the full space of proteins, both those with family annotations and unannotated sequences, so that at run-time, our trained model can accurately predict which database sequences are related to the query. In other words, we want to introduce learning into the general sequence comparison problem, without resorting to a more limited family-based model approach. One primary goal of this research is the development of algorithms that exploit the additional or hidden structure of the problem. To this end, we experiment with a number of learning algorithms, including constrained clustering, neighborhood averaging, use of hierarchical labels and ensembles of classifiers, techniques for dimensionality reduction like non-negative matrix factorization, and kernel-based semi-supervised approaches.In addition to algorithm development, we plan to produce a software implementation and web interface that will make our techniques available to the biological community. Throughout our research, we will emphasize techniques that are scalable. We want the actual prediction time to be fast, so that a user can enter a newquery sequence and retrieve a ranked list of related sequences from the database in real time via a web interface. Thus we focus on two features: training offline, which allows us to take advantage of more expensive computation in the training process so that the predictions can be fast; and use of fast string kernels, a technique from our work on protein classification that will enable run-time speed-up. Broader Impacts. Pairwise sequence comparison is a central problem in bioinformatics and genomics, and our techniques for improving performance and scalability of protein sequence comparison through state-of-the-art machine learning techniques will be broadly useful to biologists and bioinformaticians. The software implementation and web interface that we will produce as part of our proposal will make our techniques available to the biological community. All specifications, datasets, and results from our research will be made publicly available via our web site. All new algorithms will also be described in publications for dissemination to the machine learning community. Finally, we note that the learning challenges of our sequence comparison problem -- for example, learning in a setting with a large amount of unlabelled data and only a small amount of labelled data -- occur in many other applied areas of machine learning, such as text classification and information retrieval. Thus our research will have impact in many applied learning and data-driven fields.
PI: Christina leslico -PI: William Stafford诺贝尔合作者:Jason weston合作者:Andre elisseeffr:蛋白质序列比较的机器学习方法:判别,半监督,可扩展算法研究目标。两两序列比对是生物信息学的“杀手级应用”。在这个任务中,用户用单个序列查询一个蛋白质数据库,算法返回一个序列的排序列表,这些序列可能与该查询在进化上相关。两个来自共同祖先序列的序列——尽管它们的序列相似性可能很微妙——可能具有相似的三维结构,并在细胞中扮演相似的功能角色。因此,识别细微的序列相似性对于推断蛋白质的进化、功能和结构是有用的。几乎所有现有的两两序列比较算法都属于两类:启发式比对算法,可扩展到大型数据库,但无法捕获微妙的蛋白质相似性;基于蛋白质家族模型的方法可以准确地确定序列是否符合特定的家族模型,但不能评估两个未注释的蛋白质之间的相似性。流行的PSI-BLAST算法是这两种方法的混合:它尝试动态地从单个查询序列迭代地构建模型,然后在数据库中搜索适合该模型的序列。虽然有效,PSI-BLAST已知不是最准确的方法来检测更多的远程蛋白质关系。我们在本提案中采用的方法从根本上说是新的:我们使用机器学习算法对来自整个蛋白质空间的示例进行离线训练,包括那些具有家族注释和未注释序列的示例,以便在运行时,我们训练的模型可以准确地预测哪些数据库序列与查询相关。换句话说,我们希望将学习引入到一般的序列比较问题中,而不是诉诸于更有限的基于家族的模型方法。这项研究的一个主要目标是开发利用问题的附加或隐藏结构的算法。为此,我们实验了许多学习算法,包括约束聚类、邻域平均、使用分层标签和分类器集成、降维技术(如非负矩阵分解)和基于核的半监督方法。除了算法开发之外,我们还计划制作一个软件实现和网络界面,使我们的技术可用于生物界。在我们的研究中,我们将强调可扩展的技术。我们希望实际的预测时间快一些,这样用户就可以输入一个新的查询序列,并通过web界面实时地从数据库中检索相关序列的排序列表。因此,我们专注于两个特征:离线训练,这允许我们在训练过程中利用更昂贵的计算,从而可以快速预测;使用快速字符串核,这是我们在蛋白质分类方面的一项技术,它将使运行时加速。更广泛的影响。两两序列比较是生物信息学和基因组学的核心问题,我们通过最先进的机器学习技术提高蛋白质序列比较的性能和可扩展性的技术将对生物学家和生物信息学家广泛有用。作为我们提案的一部分,我们将制作的软件实现和网络界面将使我们的技术可供生物界使用。我们研究的所有规格、数据集和结果将通过我们的网站公开提供。所有新算法也将在出版物中进行描述,以便向机器学习社区传播。最后,我们注意到序列比较问题的学习挑战——例如,在具有大量未标记数据和只有少量标记数据的环境中学习——发生在机器学习的许多其他应用领域,如文本分类和信息检索。因此,我们的研究将对许多应用学习和数据驱动领域产生影响。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Christina Leslie其他文献

Group IVA phospholipase A2 is necessary for growth cone repulsion and collapse
IVA 族磷脂酶 A2 对于生长锥排斥和塌陷是必需的
  • DOI:
  • 发表时间:
    2012
  • 期刊:
  • 影响因子:
    4.7
  • 作者:
    S. Sanford;Bo Goen Yun;Christina Leslie;R. C. Murphy;K. Pfenninger
  • 通讯作者:
    K. Pfenninger
Latent Class Model
潜在类模型
  • DOI:
    10.1007/978-0-387-30164-8_442
  • 发表时间:
    2010
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Geoffrey I. Webb;Claude Sammut;Claudia Perlich;T. Horváth;Stefan Wrobel;K. Korb;W. S. Noble;Christina Leslie;M. Lagoudakis;Novi Quadrianto;W. Buntine;L. Getoor;Galileo Namata;Xin Jin, Jiawei Han;Jo;S. Vijayakumar;Stefan Schaal;L. D. Raedt
  • 通讯作者:
    L. D. Raedt
Randomised double-blind, placebo-controlled trial of coenzyme Q<sub>10</sub> therapy in class II and III systolic heart failure
  • DOI:
    10.1046/j.1443-9506.2003.00189.x
  • 发表时间:
    2003-01-01
  • 期刊:
  • 影响因子:
  • 作者:
    Anne Keogh;Steve Fenton;Christina Leslie;Christina Aboyoun;Peter Macdonald;Michael Yi Chen Zhao;Franklin Bailey; Rosenfeldt
  • 通讯作者:
    Rosenfeldt
Learning By Imitation
通过模仿学习
  • DOI:
    10.1007/978-0-387-30164-8_448
  • 发表时间:
    2010
  • 期刊:
  • 影响因子:
    1.7
  • 作者:
    Geoffrey I. Webb;Claude Sammut;Claudia Perlich;T. Horváth;Stefan Wrobel;K. Korb;W. S. Noble;Christina Leslie;M. Lagoudakis;Novi Quadrianto;W. Buntine;L. Getoor;Galileo Namata;Xin Jin, Jiawei Han;Jo;S. Vijayakumar;Stefan Schaal;L. D. Raedt
  • 通讯作者:
    L. D. Raedt
2007 – TRANSCRIPTIONAL CONTROL OF CBX5 BY THE RNA BINDING PROTEINS RBMX AND RBMXL1 MAINTAINS CHROMATIN STATE IN MYELOID LEUKEMIA
  • DOI:
    10.1016/j.exphem.2021.12.372
  • 发表时间:
    2021-08-01
  • 期刊:
  • 影响因子:
  • 作者:
    Diu Nguyen;Camila Prieto;Zhaoqi Liu;Justin Wheat;Alexander Perez;Saroj Gourkanti;Timothy Chou;Ersilia Barin;Anthony Velleca;Thomas Rohwetter;Arthur Chow;James Taggart;Angela Savino;Katerina Hoskova;Meera Dhodapkar;Alexandra Schurer;trevor Barlowe;Christina Leslie;Ly Vu;Ulrich Steidl
  • 通讯作者:
    Ulrich Steidl

Christina Leslie的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Christina Leslie', 18)}}的其他基金

III-CXT: Learning from graph-structured data: new algorithms for modeling physical interactions in cellular networks
III-CXT:从图结构数据中学习:用于建模蜂窝网络中物理交互的新算法
  • 批准号:
    0835494
  • 财政年份:
    2007
  • 资助金额:
    $ 30万
  • 项目类别:
    Continuing Grant
III-CXT: Learning from graph-structured data: new algorithms for modeling physical interactions in cellular networks
III-CXT:从图结构数据中学习:用于建模蜂窝网络中物理交互的新算法
  • 批准号:
    0705580
  • 财政年份:
    2007
  • 资助金额:
    $ 30万
  • 项目类别:
    Continuing Grant
FASEB Summer Conference on Phospholipases to be held July 9-13, 2000 in Snowmass Village, Colorado
FASEB 磷脂酶夏季会议将于 2000 年 7 月 9 日至 13 日在科罗拉多州斯诺马斯村举行
  • 批准号:
    0075879
  • 财政年份:
    2000
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant

相似国自然基金

Understanding structural evolution of galaxies with machine learning
  • 批准号:
    n/a
  • 批准年份:
    2022
  • 资助金额:
    10.0 万元
  • 项目类别:
    省市级项目

相似海外基金

TRUST2 - Improving TRUST in artificial intelligence and machine learning for critical building management
TRUST2 - 提高关键建筑管理的人工智能和机器学习的信任度
  • 批准号:
    10093095
  • 财政年份:
    2024
  • 资助金额:
    $ 30万
  • 项目类别:
    Collaborative R&D
Quantum Machine Learning for Financial Data Streams
金融数据流的量子机器学习
  • 批准号:
    10073285
  • 财政年份:
    2024
  • 资助金额:
    $ 30万
  • 项目类别:
    Feasibility Studies
Explainable machine learning for electrification of everything
可解释的机器学习,实现万物电气化
  • 批准号:
    LP230100439
  • 财政年份:
    2024
  • 资助金额:
    $ 30万
  • 项目类别:
    Linkage Projects
DMS-EPSRC: Asymptotic Analysis of Online Training Algorithms in Machine Learning: Recurrent, Graphical, and Deep Neural Networks
DMS-EPSRC:机器学习中在线训练算法的渐近分析:循环、图形和深度神经网络
  • 批准号:
    EP/Y029089/1
  • 财政年份:
    2024
  • 资助金额:
    $ 30万
  • 项目类别:
    Research Grant
Machine Learning for Computational Water Treatment
用于计算水处理的机器学习
  • 批准号:
    EP/X033244/1
  • 财政年份:
    2024
  • 资助金额:
    $ 30万
  • 项目类别:
    Research Grant
Postdoctoral Fellowship: OPP-PRF: Leveraging Community Structure Data and Machine Learning Techniques to Improve Microbial Functional Diversity in an Arctic Ocean Ecosystem Model
博士后奖学金:OPP-PRF:利用群落结构数据和机器学习技术改善北冰洋生态系统模型中的微生物功能多样性
  • 批准号:
    2317681
  • 财政年份:
    2024
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
RII Track-4:NSF: Physics-Informed Machine Learning with Organ-on-a-Chip Data for an In-Depth Understanding of Disease Progression and Drug Delivery Dynamics
RII Track-4:NSF:利用器官芯片数据进行物理信息机器学习,深入了解疾病进展和药物输送动力学
  • 批准号:
    2327473
  • 财政年份:
    2024
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
CAREER: Blessing of Nonconvexity in Machine Learning - Landscape Analysis and Efficient Algorithms
职业:机器学习中非凸性的祝福 - 景观分析和高效算法
  • 批准号:
    2337776
  • 财政年份:
    2024
  • 资助金额:
    $ 30万
  • 项目类别:
    Continuing Grant
CC* Campus Compute: UTEP Cyberinfrastructure for Scientific and Machine Learning Applications
CC* 校园计算:用于科学和机器学习应用的 UTEP 网络基础设施
  • 批准号:
    2346717
  • 财政年份:
    2024
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
Learning to create Intelligent Solutions with Machine Learning and Computer Vision: A Pathway to AI Careers for Diverse High School Students
学习利用机器学习和计算机视觉创建智能解决方案:多元化高中生的人工智能职业之路
  • 批准号:
    2342574
  • 财政年份:
    2024
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了