Statistics of Sequence Comparison
序列比较统计
基本信息
- 批准号:10007519
- 负责人:
- 金额:$ 23.53万
- 依托单位:
- 依托单位国家:美国
- 项目类别:
- 财政年份:
- 资助国家:美国
- 起止时间:至
- 项目状态:未结题
- 来源:
- 关键词:AcetyltransferaseAmino Acid SequenceAmino AcidsAppearanceBiochemicalBiochemistryBiologicalCharacteristicsCluster AnalysisCollaborationsComputational BiologyCouplingDNA SequenceDataDevelopmentDimensionsElementsEquilibriumFamily CharacteristicsFrequenciesGoalsGuanosine Triphosphate PhosphohydrolasesIndividualInstitutesJournalsLengthMarylandMathematicsMethodsModelingMolecular BiologyPatternPhosphoric Monoester HydrolasesPositioning AttributeProbabilityProtein FamilyProteinsPublishingRNA HelicaseRoleSequence AlignmentStructural ProteinStructureSystemThymineUniversitiesWorkbasedensitygenome sciencesimprovedmedical schoolsmembernucleaseprotein structurerhostatisticssynaptojaninuracil-DNA glycosylasevector
项目摘要
The current direction of this project, in collaboration with Dr.
Andrew Neuwald of the Institute for Genome Sciences and Department
of Biochemistry & Molecular Biology at the University of Maryland
School of Medicine, continued throughout this year. Previous
focuses had been the development of an improved method for multiple
alignment that could identify the common elements shared by large
and diverse protein superfamilies, and the extension of this method
to a hierarchical multiple alignment model. Such a model is based
on the fact that large protein superfamilies frequently have
diversified to fulfill distinct functional roles within different
subfamilies. Each subfamily has distinct structural constraints,
which yield distinct amino acid frequency vectors at particular
positions characteristic of that subfamily. Although, within a
subfamily, the amino acids at different positions may be independent,
the changes in frequency vectors across multiple positions
characteristic of each subfamily yields the appearance of
correlation between positions when a simple, non-hierarchical
model of a superfamily is constructed. Earlier approaches have
modeled these apparent correlations directly, using pairwise
coupling terms, but we model them by constructing an explicit
hierarchical model, with individual sequences assigned to distinct
nodes within the hierarchy. We applied the Minimum Description
Length principle to insure that the hierarchical models we
construct do not overfit the data, but have statistical support.
This year the central focus this project was the statistical
assessment of the three-dimensional clustering of "distinguished
positions", identified as characteristic of various nodes in
a hierarchy. Our approach, called Initial Cluster Analysis (ICA),
seeks to determine whether a set of distinguished elements within
a linear array is clustered significantly near the start of the
array and, if so, what is the most significant initial cluster
of these elements. Abstractly, given a linear array of length L
containing D '1's (the distinguished elements) and L-D '0's,
it considers a generative model in which in which the '1's occur
with particular and differing probabilities before and after a
cut point X in the array. For any particular X it is relatively
easy to calculate a likelihood Like(X) of the array of data,
and one may optimize Like(X) by simply evaluating it for all
possible X. However, the values of Like(X) for close values
of X are highly correlated, dependent upon a calculable "density
of independent trials" Rho(X). Because Rho(X) is not constant
but rather grows approximately as the reciprocal of X's distance
from 0 or L, simply optimizing Like(X) inherently favors, a priori,
small or large values of X. Therefore, if one's application
suggests no such bias, choosing to optimize Like(X)/Rho(X) rather
than Like(X) for a given array of '0's and '1's may be a better
strategy; we refer to this approach as using "flattened priors".
ICA estimates the effective total number of independent trials
implicit in either optimization, which it uses in calculating
a p-value for the optimal X. This provides a mathematically
principled way to define an optimal initial cluster of
distinguished elements, balancing the claims of very short
and dense clusters with those of longer but sparser clusters.
We published ICA in the Journal of Computational Biology.
To analyze real proteins using ICA, we ordered the residues within
a protein by their physical distance from a point of reference,
and used our previously-developed hierarchical analysis to define
a set of distinguished residues, characteristic of a protein family
or subfamily. ICA then allows us to find sets of distinguished
residues that are significantly clustered in three dimensions.
Applying this approach to N-acetyltransferases, P-loop GTPases,
RNA helicases, synaptojanin-superfamily phosphatases and nucleases,
and thymine/uracil DNA glycosylases yielded results congruent with
biochemical understanding of these proteins, and also revealed
striking sequence-structural features overlooked by other methods.
This work was published in eLife.
We initiated work on a new project to summarize and analyze the
constraints on protein sequence and structure that may be derived
from large multiple sequence alignments. For a particular protein,
these constraints include those on amino acid usage in particular
positions due to the protein's subfamily function, as well as
those constraints characteristic of the family and superfamily
of which the protein is a member. Additional constraints, which
may be derived from DCA, are due to internal or heterodimeric
pairwise interactions between different protein positions. The
integrated analysis of these various constraints can suggest new
lines for experimentation.
该项目目前的方向是与Dr。
安德鲁·诺瓦尔德,基因组科学研究所和系
马里兰大学生物化学与分子生物学专业
医学院,今年全年都在继续。上一首
重点是开发了一种改进的方法来处理多个
对齐,可以识别大型
和不同的蛋白质超家族,以及该方法的扩展
到分层多对齐模型。这样的模型是基于
关于大型蛋白质超家族经常有
多样化,以在不同的内部实现不同的功能角色
子族。每个子家族都有不同的结构约束,
它们特别产生了不同的氨基酸频率矢量
这个子家族特有的位置。尽管,在一个
亚家族,不同位置的氨基酸可能是独立的,
跨多个位置的频率向量的变化
每个亚家族的特征产生了
位置之间的相关性时,简单的、无层次的
构造了一个超家族的模型。早期的方法已经
直接对这些明显的相关性进行建模,使用成对的
耦合项,但我们通过构造一个显式
分层模型,将单个序列分配给不同的
层次结构中的节点。我们应用了最小描述
长度原则,以确保我们的分层模型
构造不会过度拟合数据,但有统计支持。
今年这个项目的中心焦点是统计
《尊贵》的三维聚类性评价
位置“,标识为中各个节点的特征
一种等级制度。我们的方法,称为初始聚类分析(ICA),
试图确定一组不同的元素是否
线性数组显著地聚集在
阵列,如果是,最重要的初始群集是什么
这些元素中。抽象地,给定一个长度为L的线性数组
包括D‘1’S(杰出分子)和L-D‘0’S,
它考虑了一种生成模式,在该模式中,出现了S
具有特定且不同的概率
在阵列中截断X点。对于任何特定的X,它都是相对的
容易计算数据阵列的似然度(X),
人们可以通过简单地对所有人进行评估来优化(X)
可能的X。然而,对于封闭值,LIKE(X)的值
是高度相关的,取决于一个可计算的“密度”
Rho(X)。因为Rho(X)不是常数
而是大致按X距离的倒数增长
从0或L,简单的优化,如(X),先验地偏爱,
因此,如果一个人的应用程序
建议没有这种偏见,而是选择像(X)/Rho(X)那样进行优化
对于给定的‘0’S和‘1’S的数组,LIKE(X)可能更好
策略;我们将这种方法称为使用“扁平化的前科”。
ICA估计有效的独立试验总数
隐含在任一优化中,它在计算时使用
最优X的p值。这在数学上提供了一个
定义最优初始集群的原则性方法
杰出的元素,平衡了非常短的主张
以及密度较大的星团和较长但较稀疏的星团。
我们在《计算生物学杂志》上发表了ICA。
为了使用ICA分析真正的蛋白质,我们对
根据蛋白质与参照点的物理距离,
并使用我们之前开发的层次分析来定义
一组独特的残基,具有蛋白质家族的特征
或者说亚科。ICA然后允许我们找到多组不同的
在三个维度上显著聚集的残基。
将该方法应用于N-乙酰转移酶、P-环状GTP酶、
RNA解旋酶,突触素超家族磷酸酶和核酸酶,
胸腺嘧啶/尿嘧啶DNA糖基酶的结果与
对这些蛋白质的生化理解,并揭示了
引人注目的序列--被其他方法忽视的结构特征。
这项研究发表在《eLife》杂志上。
我们启动了一个新项目,以总结和分析
对可能衍生的蛋白质序列和结构的限制
从大的多重序列比对中。对于一种特定的蛋白质,
这些限制特别包括对氨基酸使用的限制
由于蛋白质亚家族功能的位置,以及
这些约束是家族和超级家族的特征
该蛋白质是其中的一员。其他约束条件,即
可能源于DCA,是由于内部或异二聚体
不同蛋白质位置之间的成对相互作用。这个
对这些不同约束的综合分析可以提出新的
等待实验的队伍。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
STEPHEN F ALTSCHUL其他文献
STEPHEN F ALTSCHUL的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('STEPHEN F ALTSCHUL', 18)}}的其他基金
Improvements And Extensions To The Blast Algorithms
Blast 算法的改进和扩展
- 批准号:
6546809 - 财政年份:
- 资助金额:
$ 23.53万 - 项目类别:
Improvements And Extensions To The Blast Algorithms
Blast 算法的改进和扩展
- 批准号:
6843572 - 财政年份:
- 资助金额:
$ 23.53万 - 项目类别:
IMPROVEMENTS AND EXTENSIONS TO THE BLAST ALGORITHMS
Blast 算法的改进和扩展
- 批准号:
6432754 - 财政年份:
- 资助金额:
$ 23.53万 - 项目类别:
Improvements and Extensions to the BLAST Algorithms
BLAST 算法的改进和扩展
- 批准号:
9555732 - 财政年份:
- 资助金额:
$ 23.53万 - 项目类别:
相似海外基金
Cerebral infarction treatment strategy using collagen-like "triple helix peptide" containing functional amino acid sequence
含功能氨基酸序列的类胶原“三螺旋肽”治疗脑梗塞策略
- 批准号:
23K06972 - 财政年份:2023
- 资助金额:
$ 23.53万 - 项目类别:
Grant-in-Aid for Scientific Research (C)
Establishment of a screening method for functional microproteins independent of amino acid sequence conservation
不依赖氨基酸序列保守性的功能性微生物蛋白筛选方法的建立
- 批准号:
23KJ0939 - 财政年份:2023
- 资助金额:
$ 23.53万 - 项目类别:
Grant-in-Aid for JSPS Fellows
Effects of amino acid sequence and lipids on the structure and self-association of transmembrane helices
氨基酸序列和脂质对跨膜螺旋结构和自缔合的影响
- 批准号:
19K07013 - 财政年份:2019
- 资助金额:
$ 23.53万 - 项目类别:
Grant-in-Aid for Scientific Research (C)
Construction of electron-transfer amino acid sequence probe with an interaction for protein and cell
蛋白质与细胞相互作用的电子转移氨基酸序列探针的构建
- 批准号:
16K05820 - 财政年份:2016
- 资助金额:
$ 23.53万 - 项目类别:
Grant-in-Aid for Scientific Research (C)
Development of artificial antibody of anti-bitter taste receptor using random amino acid sequence library
利用随机氨基酸序列库开发抗苦味受体人工抗体
- 批准号:
16K08426 - 财政年份:2016
- 资助金额:
$ 23.53万 - 项目类别:
Grant-in-Aid for Scientific Research (C)
The aa15-17 amino acid sequence in the terminal protein domain of HBV polymerase as a viral factor affect-ing in vivo as well as in vitro replication activity of the virus.
HBV聚合酶末端蛋白结构域中的aa15-17氨基酸序列作为影响病毒体内和体外复制活性的病毒因子。
- 批准号:
25461010 - 财政年份:2013
- 资助金额:
$ 23.53万 - 项目类别:
Grant-in-Aid for Scientific Research (C)
Amino acid sequence analysis of fossil proteins using mass spectrometry
使用质谱法分析化石蛋白质的氨基酸序列
- 批准号:
23654177 - 财政年份:2011
- 资助金额:
$ 23.53万 - 项目类别:
Grant-in-Aid for Challenging Exploratory Research
Precise hybrid synthesis of glycoprotein through amino acid sequence-specific introduction of oligosaccharide followed by enzymatic transglycosylation reaction
通过氨基酸序列特异性引入寡糖,然后进行酶促糖基转移反应,精确杂合合成糖蛋白
- 批准号:
22550105 - 财政年份:2010
- 资助金额:
$ 23.53万 - 项目类别:
Grant-in-Aid for Scientific Research (C)
Estimating selection on amino-acid sequence polymorphisms in Drosophila
果蝇氨基酸序列多态性选择的估计
- 批准号:
NE/D00232X/1 - 财政年份:2006
- 资助金额:
$ 23.53万 - 项目类别:
Research Grant
Construction of a neural network for detecting novel domains from amino acid sequence information only
构建仅从氨基酸序列信息检测新结构域的神经网络
- 批准号:
16500189 - 财政年份:2004
- 资助金额:
$ 23.53万 - 项目类别:
Grant-in-Aid for Scientific Research (C)