权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

COMPUTATIONAL LEARNING & DISCOVERY FOR BIOLOGICAL SEQUENCE, STRUCTURE, FUNCTION

计算学习

基本信息

批准号：
7369285
负责人：
RAJ REDDY
金额：
$ 0.12万
依托单位：
BOSTON UNIVERSITY MEDICAL CAMPUS
依托单位国家：
美国
项目类别：
财政年份：
2006
资助国家：
美国
起止时间：
2006-07-01 至 2007-06-30
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/7369285
关键词：
COMPUTATIONAL LEARNING DISCOVERY BIOLOGICAL SEQUENCE

项目摘要

This subproject is one of many research subprojects utilizing the resources provided by a Center grant funded by NIH/NCRR. The subproject and investigator (PI) may have received primary funding from another NIH source, and thus could be represented in other CRISP entries. The institution listed is for the Center, which is not necessarily the institution for the investigator. Seven focus areas in the realm of protein structure have been identified for application of the language analogy approach. These focus areas are: protein folding, conformational changes, protein-protein interactions, protein/gene networks and pathways, secondary structure and repetitive folds prediction and segmentation, protein family classification, and genome comparison. The ultimate goal is to develop linguistic models for each that are capable of advancing the understanding of these areas. The protocol followed in this process consists of several steps. The first step is to utilize existing ¿benchmark ¿ datasets or to define datasets suitable for training and testing of these models. As controls, existing approaches in the focus areas, if available, are studied and a scheme is designed for evaluating the language model approaches and comparing them to existing other approaches. The next step is to implement our language approach. This implementation initially needs to meet one or both of two requirements: (i) the system has to perform equally well or better than existing systems as defined in step 2 and/or (ii) it needs to provide interpretable biological hypotheses. For example, a neural network might be the algorithm with best performance in a classification task, but the underlying features resulting in this performance can be unclear. A language-based approach that might have lesser performance but allows the researcher to analyze the types of features that result in successful classification can be used to build hypotheses on the fundamental building blocks of protein sequence language. The final step in the protocol is to design and carry out experiments that specifically test these hypotheses. The following systems have been chosen as experimental test cases for the language models: G protein coupled receptors (GPCR) such as rhodopsin, metabotropic glutamate receptors, epidermal growth factor receptor, viral tailspike protein, virus infection process, peptide n-grams. For each of the seven focus areas, we are working to identify or develop benchmark datasets for training and testing of linguistic models. Students and postdoctoral fellows participate in all aspects of the projects.

这个子项目是利用由NIH/NCRR资助的中心拨款提供的资源的许多研究子项目之一。子项目和调查员(PI)可能从另一个NIH来源获得了主要资金，因此可能会出现在其他CRISE条目中。列出的机构是针对中心的，而不一定是针对调查员的机构。蛋白质结构领域的七个重点领域已被确定为语言类比方法的应用。这些重点领域是：蛋白质折叠、构象变化、蛋白质-蛋白质相互作用、蛋白质/基因网络和途径、二级结构和重复折叠的预测和分割、蛋白质家族分类和基因组比较。最终目标是为每个人开发能够促进对这些领域的理解的语言模型。此过程中遵循的协议由几个步骤组成。第一步是利用现有的基准数据集或定义适用于这些模型的训练和测试的数据集。作为对照，研究了重点领域中的现有方法，如果有的话，并设计了一种方案来评估语言模型方法并将它们与现有的其他方法进行比较。下一步是实现我们的语言方法。这种实施最初需要满足两个要求中的一个或两个：(I)系统必须与步骤2中定义的现有系统一样好或更好，和/或(Ii)它需要提供可解释的生物学假设。例如，神经网络可能是分类任务中性能最好的算法，但导致这种性能的基本特征可能不清楚。一种基于语言的方法可能具有较低的性能，但允许研究人员分析导致成功分类的特征类型，可以用于在蛋白质序列语言的基本构建块上建立假设。该协议的最后一步是设计和进行专门测试这些假设的实验。选择下列系统作为语言模型的实验测试用例：G蛋白偶联受体(GPCR)，如视紫红质、代谢性谷氨酸受体、表皮生长因子受体、病毒尾尖蛋白、病毒感染过程、肽n-gram。对于七个重点领域中的每一个领域，我们都在努力确定或开发用于培训和测试语言模型的基准数据集。学生和博士后研究员参与项目的各个方面。