权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Statistics of Sequence Comparison

序列比较统计

基本信息

批准号：
10007519
负责人：
STEPHEN F ALTSCHUL
金额：
$ 23.53万
依托单位：
NATIONAL LIBRARY OF MEDICINE
依托单位国家：
美国
项目类别：
财政年份：
资助国家：
美国
起止时间：
至
项目状态：
未结题

项目摘要

The current direction of this project, in collaboration with Dr. Andrew Neuwald of the Institute for Genome Sciences and Department of Biochemistry & Molecular Biology at the University of Maryland School of Medicine, continued throughout this year. Previous focuses had been the development of an improved method for multiple alignment that could identify the common elements shared by large and diverse protein superfamilies, and the extension of this method to a hierarchical multiple alignment model. Such a model is based on the fact that large protein superfamilies frequently have diversified to fulfill distinct functional roles within different subfamilies. Each subfamily has distinct structural constraints, which yield distinct amino acid frequency vectors at particular positions characteristic of that subfamily. Although, within a subfamily, the amino acids at different positions may be independent, the changes in frequency vectors across multiple positions characteristic of each subfamily yields the appearance of correlation between positions when a simple, non-hierarchical model of a superfamily is constructed. Earlier approaches have modeled these apparent correlations directly, using pairwise coupling terms, but we model them by constructing an explicit hierarchical model, with individual sequences assigned to distinct nodes within the hierarchy. We applied the Minimum Description Length principle to insure that the hierarchical models we construct do not overfit the data, but have statistical support. This year the central focus this project was the statistical assessment of the three-dimensional clustering of "distinguished positions", identified as characteristic of various nodes in a hierarchy. Our approach, called Initial Cluster Analysis (ICA), seeks to determine whether a set of distinguished elements within a linear array is clustered significantly near the start of the array and, if so, what is the most significant initial cluster of these elements. Abstractly, given a linear array of length L containing D '1's (the distinguished elements) and L-D '0's, it considers a generative model in which in which the '1's occur with particular and differing probabilities before and after a cut point X in the array. For any particular X it is relatively easy to calculate a likelihood Like(X) of the array of data, and one may optimize Like(X) by simply evaluating it for all possible X. However, the values of Like(X) for close values of X are highly correlated, dependent upon a calculable "density of independent trials" Rho(X). Because Rho(X) is not constant but rather grows approximately as the reciprocal of X's distance from 0 or L, simply optimizing Like(X) inherently favors, a priori, small or large values of X. Therefore, if one's application suggests no such bias, choosing to optimize Like(X)/Rho(X) rather than Like(X) for a given array of '0's and '1's may be a better strategy; we refer to this approach as using "flattened priors". ICA estimates the effective total number of independent trials implicit in either optimization, which it uses in calculating a p-value for the optimal X. This provides a mathematically principled way to define an optimal initial cluster of distinguished elements, balancing the claims of very short and dense clusters with those of longer but sparser clusters. We published ICA in the Journal of Computational Biology. To analyze real proteins using ICA, we ordered the residues within a protein by their physical distance from a point of reference, and used our previously-developed hierarchical analysis to define a set of distinguished residues, characteristic of a protein family or subfamily. ICA then allows us to find sets of distinguished residues that are significantly clustered in three dimensions. Applying this approach to N-acetyltransferases, P-loop GTPases, RNA helicases, synaptojanin-superfamily phosphatases and nucleases, and thymine/uracil DNA glycosylases yielded results congruent with biochemical understanding of these proteins, and also revealed striking sequence-structural features overlooked by other methods. This work was published in eLife. We initiated work on a new project to summarize and analyze the constraints on protein sequence and structure that may be derived from large multiple sequence alignments. For a particular protein, these constraints include those on amino acid usage in particular positions due to the protein's subfamily function, as well as those constraints characteristic of the family and superfamily of which the protein is a member. Additional constraints, which may be derived from DCA, are due to internal or heterodimeric pairwise interactions between different protein positions. The integrated analysis of these various constraints can suggest new lines for experimentation.

该项目目前的方向是与Dr。安德鲁·诺瓦尔德，基因组科学研究所和系马里兰大学生物化学与分子生物学专业医学院，今年全年都在继续。上一首重点是开发了一种改进的方法来处理多个对齐，可以识别大型和不同的蛋白质超家族，以及该方法的扩展到分层多对齐模型。这样的模型是基于关于大型蛋白质超家族经常有多样化，以在不同的内部实现不同的功能角色子族。每个子家族都有不同的结构约束，它们特别产生了不同的氨基酸频率矢量这个子家族特有的位置。尽管，在一个亚家族，不同位置的氨基酸可能是独立的，跨多个位置的频率向量的变化每个亚家族的特征产生了位置之间的相关性时，简单的、无层次的构造了一个超家族的模型。早期的方法已经直接对这些明显的相关性进行建模，使用成对的耦合项，但我们通过构造一个显式分层模型，将单个序列分配给不同的层次结构中的节点。我们应用了最小描述长度原则，以确保我们的分层模型构造不会过度拟合数据，但有统计支持。今年这个项目的中心焦点是统计《尊贵》的三维聚类性评价位置“，标识为中各个节点的特征一种等级制度。我们的方法，称为初始聚类分析(ICA)，试图确定一组不同的元素是否线性数组显著地聚集在阵列，如果是，最重要的初始群集是什么这些元素中。抽象地，给定一个长度为L的线性数组包括D‘1’S(杰出分子)和L-D‘0’S，它考虑了一种生成模式，在该模式中，出现了S 具有特定且不同的概率在阵列中截断X点。对于任何特定的X，它都是相对的容易计算数据阵列的似然度(X)，人们可以通过简单地对所有人进行评估来优化(X) 可能的X。然而，对于封闭值，LIKE(X)的值是高度相关的，取决于一个可计算的“密度” Rho(X)。因为Rho(X)不是常数而是大致按X距离的倒数增长从0或L，简单的优化，如(X)，先验地偏爱，因此，如果一个人的应用程序建议没有这种偏见，而是选择像(X)/Rho(X)那样进行优化对于给定的‘0’S和‘1’S的数组，LIKE(X)可能更好策略；我们将这种方法称为使用“扁平化的前科”。 ICA估计有效的独立试验总数隐含在任一优化中，它在计算时使用最优X的p值。这在数学上提供了一个定义最优初始集群的原则性方法杰出的元素，平衡了非常短的主张以及密度较大的星团和较长但较稀疏的星团。我们在《计算生物学杂志》上发表了ICA。为了使用ICA分析真正的蛋白质，我们对根据蛋白质与参照点的物理距离，并使用我们之前开发的层次分析来定义一组独特的残基，具有蛋白质家族的特征或者说亚科。ICA然后允许我们找到多组不同的在三个维度上显著聚集的残基。将该方法应用于N-乙酰转移酶、P-环状GTP酶、 RNA解旋酶，突触素超家族磷酸酶和核酸酶，胸腺嘧啶/尿嘧啶DNA糖基酶的结果与对这些蛋白质的生化理解，并揭示了引人注目的序列--被其他方法忽视的结构特征。这项研究发表在《eLife》杂志上。我们启动了一个新项目，以总结和分析对可能衍生的蛋白质序列和结构的限制从大的多重序列比对中。对于一种特定的蛋白质，这些限制特别包括对氨基酸使用的限制由于蛋白质亚家族功能的位置，以及这些约束是家族和超级家族的特征该蛋白质是其中的一员。其他约束条件，即可能源于DCA，是由于内部或异二聚体不同蛋白质位置之间的成对相互作用。这个对这些不同约束的综合分析可以提出新的等待实验的队伍。