权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Detecting Homology in the "Twilight Zone" of Sequence Similarity

检测序列相似性“暮光区”的同源性

基本信息

批准号：
7799248
负责人：
RANDEN LEE PATTERSON
金额：
$ 14.16万
依托单位：
PENNSYLVANIA STATE UNIVERSITY, THE
依托单位国家：
美国
项目类别：
财政年份：
2009
资助国家：
美国
起止时间：
2009-04-10 至 2011-02-01
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/7799248
关键词：
Algorithms Amino Acid Sequence Area Benchmarking Blast Cell Case Study Characteristics Complex Computational Technique Computers Data Data Set Databases Detection Development Disease Evolution Explosion Family Fingerprint Grant Introns Laboratories Length Location Manuals Maps Measurement Measures Methods Modeling Peptide Sequence Determination Performance Phylogenetic Analysis Plant Roots Plasmids Polymerase Process Protein Engineering Proteins RNA Viruses RNA-Directed RNA Polymerase Recording of previous events Regulatory Element Reporting Research Resolution Resources Retroelements Scientist Sequence Alignment Set protein Speed Structural Models Structure Translational Research Trees Viral Virus Work arm base clinically relevant combat domain mapping falls insight knowledge base pharmacophore protein structure public health relevance research study simulation therapy development tool user-friendly viral RNA

项目摘要

DESCRIPTION (provided by applicant): The `protein problem' has remained unsolved despite decades of research [1, 2]. In principle, one expects that the primary amino acid sequence of a protein determines its structure, function, and evolutionary (SF&E) characteristics. Yet, there still is no reliable method for predicting the native state structure of a protein and its function given only its sequence. In addition, inferring the evolutionary relationships among highly divergent protein sequences is a daunting task. In general, when pairwise sequence alignments between protein sequences fall below 25% identity, statistical measurements do not provide support robust enough to identify clear phylogenetic relationships despite intensive research in this area [1, 3, 4]. The recent explosion in the availability of knowledge bases and computational techniques for the analysis of complex data has created an unprecedented opportunity for teasing out invaluable information from protein sequences. Starting with a basic premise that protein sequence encodes information about SF&E, we developed a unified framework for inferring SF&E from sequence information using a knowledge-based approach in which we measure the similarity between a query sequence and a set of biologically relevant profiles in an unbiased manner. Results from this Gestalt Domain Detection Algorithm-Basic Local Alignment Tool (GDDA-BLAST) provide phylogenetic profiles that have the capacity to model SF&E relationships of various proteins. Indeed, GDDA-BLAST is capable of deriving deep phylogenetic relationships for highly divergent proteins in a quantifiable manner [5, 6]. Preliminary results from our computational case study of the highly divergent family of retroelements accord with those previously reported, and demonstrate that GDDA-BLAST measurements can be treated as "fingerprints" that can be used to derive distance estimates and hence phylogenetic relationships without prior information, multiple sequence alignment, or manual editing. We propose that sequence information present within the "twilight zone" of sequence similarity can provide key insight into SF&E relationships among distantly related and/or rapidly evolving proteins. This proposal aims to push our limits of detecting homology within the "twilight zone" of sequence similarity by evaluating and optimizing GDDA-BLAST performance on benchmark and experimental data sets. Armed with these refined GDDA- BLAST measurements we propose to conduct a comprehensive, ab initio, phylogenetic study of retroelements and RNA dependent RNA polymerases from the positive-strand family of RNA viruses (+ssRNA). Simultaneously we will derive high-resolution maps of domain boundaries and empirically validate functional annotations and predictions of key residues for those activities. This work aims to perform translational research from the computer to the laboratory bench top. We expect that the tools and resources generated from this grant will be accessible and user-friendly to the bench scientist, thereby speeding the discovery process of other clinically relevant research endeavors. PUBLIC HEALTH RELEVANCE: The long-term implication of this proposal is the development of a unified framework for high-resolution and simultaneous measurements of structure, function, and evolution. Should this be possible: (i) functional and evolutionary measurements could quantitatively inform structural modeling to derive accurate atomic resolution protein structures, (ii) structural and functional measurements could inform evolutionary histories to derive accurate evolutionary rates, deep-branch relationships, and homologous spaces within each protein, and (iii) structural and evolutionary measures would inform as to the location of functionalities contained within any protein and the regulatory elements which control these functions. Armed with this information, the speeds at which diseases could be understood and pharmacophores/therapies developed to combat them would likely increase dramatically.

描述(申请人提供)：经过几十年的研究，“蛋白质问题”仍然没有得到解决[1，2]。原则上，人们预计蛋白质的主要氨基酸序列决定其结构、功能和进化(SF&E)特征。然而，仍然没有可靠的方法来预测蛋白质的天然状态结构和功能，只给出它的序列。此外，推断高度分化的蛋白质序列之间的进化关系是一项艰巨的任务。一般来说，当蛋白质序列之间的成对序列比对低于25%的同源性时，统计测量不能提供足够强大的支持来识别明确的系统发育关系，尽管在这一领域进行了大量的研究[1，3，4]。最近，用于分析复杂数据的知识库和计算技术的爆炸性增长，为从蛋白质序列中梳理出宝贵的信息创造了前所未有的机会。从蛋白质序列编码SF&E信息的基本前提出发，我们开发了一个统一的框架，利用基于知识的方法从序列信息中推断SF&E，其中我们以无偏见的方式衡量查询序列和一组生物相关特征之间的相似性。格式塔结构域检测算法-基本局部比对工具(GDDA-BLAST)的结果提供了具有模拟各种蛋白质的SF&E关系的系统发育图谱。事实上，GDDA-BLAST能够以可量化的方式为高度不同的蛋白质得出深刻的系统发育关系[5，6]。我们对高度分化的逆转录元件家族的计算案例研究的初步结果与先前报道的结果一致，并表明GDDA-BLAST测量可被视为“指纹”，可用于推导距离估计，从而在没有先验信息、多序列比对或人工编辑的情况下得出系统发育关系。我们认为，存在于序列相似性的“暮光区”中的序列信息可以提供对远亲和/或快速进化的蛋白质之间的SF&E关系的关键洞察。该建议旨在通过评估和优化GDDA-BLAST在基准数据集和实验数据集上的性能，来提高我们在序列相似性的“暮光地带”内检测同源性的极限。有了这些精致的GDDA-BLAST测量，我们建议对正链RNA病毒(+ssRNA)家族的逆转录元件和依赖RNA的RNA聚合酶进行全面的从头算系统发育研究。同时，我们将得到区域边界的高分辨率地图，并经验地验证功能注释和对这些活动的关键残基的预测。这项工作旨在进行从计算机到实验室工作台的翻译研究。我们预计，从这笔赠款产生的工具和资源将对实验室科学家来说是可用的和用户友好的，从而加快其他临床相关研究工作的发现进程。公共卫生相关性：这项提议的长期影响是为结构、功能和进化的高分辨率和同时测量发展一个统一的框架。如果这是可能的：(I)功能和进化测量可以定量地为结构建模提供信息，以获得准确的原子分辨率蛋白质结构，(Ii)结构和功能测量可以为进化历史提供信息，以获得准确的进化速率、深分支关系和每个蛋白质中的同源空间，以及(Iii)结构和进化测量将告知任何蛋白质中所包含的功能的位置和控制这些功能的调控元件。有了这些信息，人们了解疾病和开发抗击疾病的药团/疗法的速度可能会大幅提高。