权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

ITR: Machine learning approaches to protein sequence comparison: discriminative, semi-supervised, scalable algorithms

ITR：蛋白质序列比较的机器学习方法：判别性、半监督、可扩展算法

基本信息

批准号：
0312706
负责人：
Christina Leslie
金额：
$ 30万
依托单位：
Columbia University
依托单位国家：
美国
项目类别：
Continuing Grant
财政年份：
2003
资助国家：
美国
起止时间：
2003-09-15 至 2007-08-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=0312706&HistoricalAwards=false
关键词：
ITR Machine learning approaches protein

项目摘要

PI: Christina LeslieCo-PI: William Stafford NobleCollaborator: Jason WestonCollaborator: Andre ElisseeffITR: Machine learning approaches to protein sequence comparison: discriminative, semi-supervised, scalable algorithmsResearch Goals. Pairwise sequence comparison is the ``killer app'' of bioinformatics. In this task, the user queries a protein database with a single sequence, and the algorithm returns a ranked list of sequences thatare likely to be evolutionarily related to the query. Two sequences that are descended from a common ancestral sequence - even though their sequence similarity may be subtle -- are likely to have similar three-dimensional structures and fill similar functional roles in the cell. Hence, recognizing subtle sequence similarities is useful for inferring protein evolution, function and structure.Almost all existing algorithms for pairwise sequence comparison fall into one of two categories: heuristic alignment algorithms that are scalable to large databases but can fail to capture subtle protein similarities; and approaches based on protein family models, which are accurate for determining whether a sequence fits a particular family model but cannot evaluate similarity between two unannotated proteins. The popular PSI-BLAST algorithm is a hybrid of the two approaches: it tries to iteratively build a model from a single query sequence on the fly and then searches the database for sequences that fit the model. While efficient, PSI-BLAST is known not to be the most accurate method for detecting more remote protein relationships.The approach that we pursue in this proposal is fundamentally new: we use machine learning algorithms to train offline on examples from the full space of proteins, both those with family annotations and unannotated sequences, so that at run-time, our trained model can accurately predict which database sequences are related to the query. In other words, we want to introduce learning into the general sequence comparison problem, without resorting to a more limited family-based model approach. One primary goal of this research is the development of algorithms that exploit the additional or hidden structure of the problem. To this end, we experiment with a number of learning algorithms, including constrained clustering, neighborhood averaging, use of hierarchical labels and ensembles of classifiers, techniques for dimensionality reduction like non-negative matrix factorization, and kernel-based semi-supervised approaches.In addition to algorithm development, we plan to produce a software implementation and web interface that will make our techniques available to the biological community. Throughout our research, we will emphasize techniques that are scalable. We want the actual prediction time to be fast, so that a user can enter a newquery sequence and retrieve a ranked list of related sequences from the database in real time via a web interface. Thus we focus on two features: training offline, which allows us to take advantage of more expensive computation in the training process so that the predictions can be fast; and use of fast string kernels, a technique from our work on protein classification that will enable run-time speed-up. Broader Impacts. Pairwise sequence comparison is a central problem in bioinformatics and genomics, and our techniques for improving performance and scalability of protein sequence comparison through state-of-the-art machine learning techniques will be broadly useful to biologists and bioinformaticians. The software implementation and web interface that we will produce as part of our proposal will make our techniques available to the biological community. All specifications, datasets, and results from our research will be made publicly available via our web site. All new algorithms will also be described in publications for dissemination to the machine learning community. Finally, we note that the learning challenges of our sequence comparison problem -- for example, learning in a setting with a large amount of unlabelled data and only a small amount of labelled data -- occur in many other applied areas of machine learning, such as text classification and information retrieval. Thus our research will have impact in many applied learning and data-driven fields.

PI: Christina leslico -PI: William Stafford诺贝尔合作者：Jason weston合作者：Andre elisseeffr：蛋白质序列比较的机器学习方法：判别，半监督，可扩展算法研究目标。两两序列比对是生物信息学的“杀手级应用”。在这个任务中，用户用单个序列查询一个蛋白质数据库，算法返回一个序列的排序列表，这些序列可能与该查询在进化上相关。两个来自共同祖先序列的序列——尽管它们的序列相似性可能很微妙——可能具有相似的三维结构，并在细胞中扮演相似的功能角色。因此，识别细微的序列相似性对于推断蛋白质的进化、功能和结构是有用的。几乎所有现有的两两序列比较算法都属于两类：启发式比对算法，可扩展到大型数据库，但无法捕获微妙的蛋白质相似性；基于蛋白质家族模型的方法可以准确地确定序列是否符合特定的家族模型，但不能评估两个未注释的蛋白质之间的相似性。流行的PSI-BLAST算法是这两种方法的混合：它尝试动态地从单个查询序列迭代地构建模型，然后在数据库中搜索适合该模型的序列。虽然有效，PSI-BLAST已知不是最准确的方法来检测更多的远程蛋白质关系。我们在本提案中采用的方法从根本上说是新的：我们使用机器学习算法对来自整个蛋白质空间的示例进行离线训练，包括那些具有家族注释和未注释序列的示例，以便在运行时，我们训练的模型可以准确地预测哪些数据库序列与查询相关。换句话说，我们希望将学习引入到一般的序列比较问题中，而不是诉诸于更有限的基于家族的模型方法。这项研究的一个主要目标是开发利用问题的附加或隐藏结构的算法。为此，我们实验了许多学习算法，包括约束聚类、邻域平均、使用分层标签和分类器集成、降维技术（如非负矩阵分解）和基于核的半监督方法。除了算法开发之外，我们还计划制作一个软件实现和网络界面，使我们的技术可用于生物界。在我们的研究中，我们将强调可扩展的技术。我们希望实际的预测时间快一些，这样用户就可以输入一个新的查询序列，并通过web界面实时地从数据库中检索相关序列的排序列表。因此，我们专注于两个特征：离线训练，这允许我们在训练过程中利用更昂贵的计算，从而可以快速预测；使用快速字符串核，这是我们在蛋白质分类方面的一项技术，它将使运行时加速。更广泛的影响。两两序列比较是生物信息学和基因组学的核心问题，我们通过最先进的机器学习技术提高蛋白质序列比较的性能和可扩展性的技术将对生物学家和生物信息学家广泛有用。作为我们提案的一部分，我们将制作的软件实现和网络界面将使我们的技术可供生物界使用。我们研究的所有规格、数据集和结果将通过我们的网站公开提供。所有新算法也将在出版物中进行描述，以便向机器学习社区传播。最后，我们注意到序列比较问题的学习挑战——例如，在具有大量未标记数据和只有少量标记数据的环境中学习——发生在机器学习的许多其他应用领域，如文本分类和信息检索。因此，我们的研究将对许多应用学习和数据驱动领域产生影响。

项目成果

期刊论文数量（0）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

数据更新时间：{{ journalArticles.updateTime }}

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

Christina Leslie其他文献

Group IVA phospholipase A2 is necessary for growth cone repulsion and collapse

IVA 族磷脂酶 A2 对于生长锥排斥和塌陷是必需的

DOI：
发表时间：
2012
期刊：
Journal of Neurochemistry
影响因子：
4.7
作者：
S. Sanford;Bo Goen Yun;Christina Leslie;R. C. Murphy;K. Pfenninger
通讯作者：
K. Pfenninger

Latent Class Model

潜在类模型

DOI：
10.1007/978-0-387-30164-8_442
发表时间：
2010
期刊：
Encyclopedia of Machine Learning
影响因子：
0
作者：
Geoffrey I. Webb;Claude Sammut;Claudia Perlich;T. Horváth;Stefan Wrobel;K. Korb;W. S. Noble;Christina Leslie;M. Lagoudakis;Novi Quadrianto;W. Buntine;L. Getoor;Galileo Namata;Xin Jin, Jiawei Han;Jo;S. Vijayakumar;Stefan Schaal;L. D. Raedt
通讯作者：
L. D. Raedt

Randomised double-blind, placebo-controlled trial of coenzyme Q<sub>10</sub> therapy in class II and III systolic heart failure

DOI：
10.1046/j.1443-9506.2003.00189.x
发表时间：
2003-01-01
期刊：
Research article
影响因子：
作者：
Anne Keogh;Steve Fenton;Christina Leslie;Christina Aboyoun;Peter Macdonald;Michael Yi Chen Zhao;Franklin Bailey; Rosenfeldt
通讯作者：
Rosenfeldt

Learning By Imitation

通过模仿学习

DOI：
10.1007/978-0-387-30164-8_448
发表时间：
2010
期刊：
Journal of Economics
影响因子：
1.7
作者：
Geoffrey I. Webb;Claude Sammut;Claudia Perlich;T. Horváth;Stefan Wrobel;K. Korb;W. S. Noble;Christina Leslie;M. Lagoudakis;Novi Quadrianto;W. Buntine;L. Getoor;Galileo Namata;Xin Jin, Jiawei Han;Jo;S. Vijayakumar;Stefan Schaal;L. D. Raedt
通讯作者：
L. D. Raedt

2007 – TRANSCRIPTIONAL CONTROL OF CBX5 BY THE RNA BINDING PROTEINS RBMX AND RBMXL1 MAINTAINS CHROMATIN STATE IN MYELOID LEUKEMIA

DOI：
10.1016/j.exphem.2021.12.372
发表时间：
2021-08-01
期刊：
Conference abstract
影响因子：
作者：
Diu Nguyen;Camila Prieto;Zhaoqi Liu;Justin Wheat;Alexander Perez;Saroj Gourkanti;Timothy Chou;Ersilia Barin;Anthony Velleca;Thomas Rohwetter;Arthur Chow;James Taggart;Angela Savino;Katerina Hoskova;Meera Dhodapkar;Alexandra Schurer;trevor Barlowe;Christina Leslie;Ly Vu;Ulrich Steidl
通讯作者：
Ulrich Steidl