权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Statistics of Sequence Comparison

序列比较统计

基本信息

批准号：
8558094
负责人：
STEPHEN F ALTSCHUL
金额：
$ 26.03万
依托单位：
NATIONAL LIBRARY OF MEDICINE
依托单位国家：
美国
项目类别：
财政年份：
资助国家：
美国
起止时间：
至
项目状态：
未结题

项目摘要

The primary focus this year was on the assessment of substitution scoring systems for aligning protein profiles to one another. Pairwise protein sequence alignments are generally evaluated using scores defined as the sum of "substitution scores" for aligning amino acids to one another, and "gap scores" for aligning runs of amino acids in one sequence to null characters inserted into the other. Protein "profiles" may be abstracted from multiple alignments of protein sequences, and substitution and gap scores have been generalized to the alignment of such profiles either to single sequences or to other profiles. Although there is widespread agreement on the general form substitution scores should take for profile-sequence alignment, little consensus has been reached on how best to construct profile-profile substitution scores, and a large number of these scoring systems have been proposed. We assessed a variety of such substitution scores, using several sets of "gold standard" multiple alignments. For our evaluation, we calculated the probability that a profile column yields a higher substitution score when aligned to a related than to an unrelated column. We also considered the same measure applied to sets of two or three adjacent columns. This simple approach had the advantages that it did not depend primarily upon the gold standard alignment columns with the weakest empirical support, and that it did not need to fit gap and offset costs for use with each substitution cost studied. No substitution scoring system emerges as superior in all our tests, but two show consistently strong behavior: a generalization of profile-sequence scores similar to those used in the Compass alignment program, and the recently proposed Bayesian Integral Log-odds (BILD) scores. A secondary focus was on the issues related to the Dirichlet mixture model, used to analyze protein sequences. The Dirichlet mixture model was introduced to protein sequence analysis by a Haussler's group at UCSC. In brief, this model imagines a particular position in a protein family is described by a multinomial distribution on the set of amino acids. Although the multinomial for a particular position may be unique, the study of many protein families reveals that certain regions of multinomial space are much more heavily populated than others. This general knowledge may be summarized by a "Dirichlet mixture prior", which is a probability density over multinomial space that lends itself to easy analysis. Our research on Dirichlet mixture priors this year centered on the question of how best to derive such priors from a set of multiple alignment data. Our previous work had applied the Minimum Description Length principle and a Gibbs sampling algorithm to this problem. Work begun this year applied the Dirichlet Process to this problem, which preliminary results suggest leads to much improved mixtures with many more components.

今年的主要重点是对替代品的评估用于将蛋白质图谱相互比对的评分系统。成对蛋白质序列比对通常使用分数定义为用于比对的“替换分数”的总和氨基酸之间的差异，以及用于比对氨基酸序列的“空位分数”。将一个序列中的氨基酸替换为插入到其他. 蛋白质“谱”可以从多重比对中提取蛋白质序列，取代和差距分数已经一般化到这样的轮廓的对准，或者到单个序列或其他谱。虽然有广泛的就替代分数的一般形式达成一致，序列比对，很少有共识已经达成了如何最好的构造配置文件的替代分数，和一个大的已经提出了许多这样的评分系统。我们评估各种各样的这种替代分数，使用几套“黄金标准”多重比对。为了评估，我们计算了配置文件列产生更高替换的概率与相关列对齐时的得分高于与不相关列对齐时的得分。我们也认为同样的措施适用于两个或三个集合相邻列。这种简单的方法具有以下优点：并不主要依赖于黄金标准对齐列最弱的经验支持，它不需要适合与所研究的每个替代成本一起使用的差距和抵消成本。在我们所有的测试中，没有一个替代评分系统是上级的，但有两个表现出一贯的强烈行为：轮廓序列分数类似于指南针中使用的分数对齐程序，以及最近提出的贝叶斯积分对数赔率（BILD）评分。第二个焦点是与狄利克雷混合物有关的问题模型，用于分析蛋白质序列。 Dirichlet混合模型是由Haussler的小组引入蛋白质序列分析的在UCSC。简而言之，这个模型想象了一个特定的位置，蛋白质家族由集合上的多项分布描述，的氨基酸。虽然对于特定位置的多项式可能是独一无二的，对许多蛋白质家族的研究表明，某些多项空间的区域人口密度比他人这一常识可以概括为“狄利克雷混合先验”，这是一个概率密度超过多项式空间，使其易于分析。我们对狄利克雷的研究今年的混合物先验集中在如何最好地从一组多个比对数据中导出这样的先验。我们以前的工作应用了最小描述长度原则和Gibbs抽样算法来解决这个问题。这项工作始于一年应用狄利克雷过程这个问题，这初步结果表明，导致更好的混合物，件.