权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

SUBSTITUTION MATRICES INTO THE NSP-TREE IN BIOLOGICAL SEQUENCE DATABASES

生物序列数据库中 NSP 树的替换矩阵

基本信息

批准号：
8167540
负责人：
GANG QIAN
金额：
$ 2.97万
依托单位：
UNIVERSITY OF OKLAHOMA HLTH SCIENCES CTR
依托单位国家：
美国
项目类别：
财政年份：
2010
资助国家：
美国
起止时间：
2010-04-01 至 2011-03-31
项目状态：
已结题

项目摘要

This subproject is one of many research subprojects utilizing the resources provided by a Center grant funded by NIH/NCRR. The subproject and investigator (PI) may have received primary funding from another NIH source, and thus could be represented in other CRISP entries. The institution listed is for the Center, which is not necessarily the institution for the investigator. A basic operation on biological sequence databases is to locate homologous regions for a given query sequence using pair-wise alignments. Unfortunately. the dynamic programming algorithm used for sequence alignments is computationally expensive, making it prohibitive for today's rapidly-growing sequence databases. Existing alignment tools, such as FAST A and BLAST. though fast in locating candidate homologous regions, sacrifice sensitivity for efficiency -they may miss some true homologous regions in database sequences. In this project, we will develop novel indexing algorithms for large biological databases that support efficient pair-wise sequence alignments with high sensitivity. Specifically, we will incorporate widely-used substitution matrices, such as PAM and BLOSUM, into the construction algorithms of the NSP-tree (an index structure designed for sequence data) so that sequences with evolutionarily-related letters are grouped together in the structure of the NSP-tree. As a result, indexed sequence groups with unrelated letters will obtain a low score when aligned to a given query sequence, and be promptly pruned. By enhancing the pruning power of the NSP-tree, we expect that the new index-based approach will provide high sensitivity while maintaining a comparable or even higher level of efficiency than that of existing pair-wise alignment tools. The project will be conducted in four steps: 1) Developing a new dynamic programming query algorithm to handle the alignments between a query sequence and sequence groups indexed in the tree; 2) Based on the substitution matrices, analyzing functionally conservative leiters in biological sequences, and creating a clustering tree that hierarchically organizes the proximity of the letters based on their evolutionary closeness; 3) Designing new heuristics that incorporate the clustering tree of letters into the construction algorithms of the NSP-tree; and 4) Conducting experimental studies on the performance of the new heuristics and comparing the performance of the NSP-tree with that of the existing tools.

这个子项目是许多研究子项目中利用资源由NIH/NCRR资助的中心拨款提供。子项目和调查员(PI)可能从NIH的另一个来源获得了主要资金，并因此可以在其他清晰的条目中表示。列出的机构是该中心不一定是调查人员的机构。对生物序列数据库的基本操作是使用成对比对来定位给定查询序列的同源区域。不幸的是。用于序列比对的动态规划算法在计算上非常昂贵，这使得它对于今天快速增长的序列数据库来说是不可能的。现有的对齐工具，如FAST A和BLAST。虽然快速定位候选同源区域，但为了效率而牺牲了敏感性--它们可能会遗漏数据库序列中的一些真正的同源区域。在这个项目中，我们将为大型生物数据库开发新的索引算法，以支持高灵敏度的高效成对序列比对。具体地说，我们将把广泛使用的替换矩阵，如PAM和Blosum，结合到NSP-树(一种为序列数据设计的索引结构)的构建算法中，以便在NSP-树的结构中将具有进化相关字母的序列分组在一起。因此，具有不相关字母的索引序列组在与给定的查询序列对齐时将获得较低的分数，并被迅速剪除。通过增强NSP树的剪枝能力，我们预计新的基于索引的方法将提供高敏感度，同时保持与现有的配对工具相当甚至更高的效率水平。该项目将分四个步骤进行：1)开发一种新的动态规划查询算法来处理查询序列与树中索引的序列组之间的比对；2)基于替换矩阵，分析生物序列中功能保守的Leiter，并创建基于字母进化贴近度的层次组织的聚类树；3)设计新的启发式算法，将字母聚类树融入到NSP-树的构建算法中；4)对新启发式算法的性能进行实验研究，并将NSP-树的性能与现有工具的性能进行比较。