权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

FINDING PROTEIN SEQUENCE MOTIFS--METHODS AND APPLICATIONS

寻找蛋白质序列基序——方法和应用

基本信息

批准号：
2578634
负责人：
E V KOONIN
金额：
--
依托单位：
NATIONAL LIBRARY OF MEDICINE
依托单位国家：
美国
项目类别：
财政年份：
资助国家：
美国
起止时间：
至
项目状态：
未结题

来源：
https://reporter.nih.gov/project-details/2578634
关键词：
DNA repair biochemical evolution brca gene chemical information system computer assisted sequence analysis computer program /software computer system design /evaluation developmental genetics genetic disorder guanine nucleotide binding protein guanine nucleotide exchange factors methyltransferase protein sequence protein structure function protooncogene statistics /biometry

项目摘要

With the rapid growth of sequence information which greatly supersedes the rate of accumulation of experimental data on protein functions, the role of sensitive methods for protein sequence analysis, including the detection of subtle but functionally important motifs, is constantly increasing. The goals of this project include the development of a coherent strategy for delineating protein superfamilies and predicting protein function, eventually aiming at the construction of a comprehensive database of protein functional motifs. The methods used included sequence database search with individual sequences (the programs of the BLAST and FASTA families) and multiple sequence alignments (HMMer program package that builds Hidden Markov Models from multiple alignments and applies them for database screening); methods for detection of motifs in protein sequences, including those developed at an earlier stage of this project (programs PAST, CAP, MoST, GIBBS); multiple sequence alignment methods (programs MACAW, CLUSTALW); methods for partitioning protein sequences into predicted globular and non-globular domains (program SEG with varying parameters); methods for prediction of protein secondary structure (programs PHD, COILS), transmembrane domains (PHDhtm), and signal peptides (Signalp); a method for prediction of coding regions in DNA based on non-homogeneous Markov models (GeneMark); methods for clustering proteins by sequence similarity (CLUS). These methods were combined in a sequence analysis strategy designed primarily in order to efficiently analyze the sequences of large, multidomain proteins which comprise the majority of the products of genes implicated in human diseases. The protein sequences were first partitioned into putative globular and non-globular domains, after which database searches were conducted separately with the sequences of individual globular domains using a combination of transitive BLAST searches and motif analysis. In addition to general purpose sequence databases, separate, smaller databases were constructed using information on protein function and/or phylogenetic origin. Two large data sets, namely the products of genes involved in animal development and the products of positionally cloned human disease genes, were analyzed using these approaches. A variety of previously uncharacterized but potentially functionally important domains and motifs were discovered. Two important examples include a putative FAD-binding domain in the human choroideremia protein with a modified dinucleotide-binding consensus which prevented its previous detection,and a domain designated BRCT, which is conserved in a number of proteins involved in DNA damage-responsive cell cycle checkpoints, including the product of the human BRCA1 gene implicated in hereditary breast and ovarian cancers.

随着序列信息的快速增长，取代了实验数据的积累速度，蛋白质功能，敏感方法的作用，蛋白质序列分析，包括检测微妙但功能重要的图案，增加。该项目的目标包括：一种描述蛋白质的连贯策略的发展超家族和预测蛋白质功能，最终旨在建立一个综合性的数据库，蛋白质功能基序。使用的方法包括序列数据库搜索与个别序列（ BLAST和FASTA家族的程序）和多个序列比对（HMMer程序包，隐马尔可夫模型从多重比对和应用用于数据库筛选）; 蛋白质序列中的基序，包括那些在该项目的早期阶段（PAST，CAP，MoST，多序列比对方法（程序 MACAW、CLUSTALW）;蛋白质分配方法预测的球状和非球状结构域的序列（具有变化参数的程序SEG）; 蛋白质二级结构的预测（程序PHD， COILS）、跨膜结构域（PHDhtm）和信号肽（Signalp）：预测DNA编码区的方法基于非齐次马尔可夫模型（GeneMark）;方法用于通过序列相似性（CLUS）聚类蛋白质。这些方法结合在一个序列分析策略主要是为了有效地分析大的多结构域蛋白质的序列，其包含大多数涉及人类基因的产物疾病首先将蛋白质序列划分为假定的球状和非球状结构域，之后数据库检索分别与单个球状结构域的序列，传递性BLAST搜索和基序的组合分析.除通用序列外数据库，建立了独立的较小数据库，使用关于蛋白质功能和/或系统发育的信息，起源两个大数据集，即基因的产物参与动物的发展和产品定位克隆的人类疾病基因，进行了分析使用这些方法。各种以前未表征但可能具有重要功能域和图案被发现。两个重要的例子包括人中推定的FAD结合结构域，具有修饰的二核苷酸结合的无脉络膜蛋白共识，防止其以前的检测，和结构域命名为BRCT，这是保守的，在许多 DNA损伤反应性细胞周期相关蛋白检查点，包括人类BRCA1基因的产物与遗传性乳腺癌和卵巢癌有关