权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

BULK-LOADING & PERFORMANCE STUDIES OF THE ND-TREE FOR LARGE GENOME DATABASES

散装

基本信息

批准号：
7610287
负责人：
GANG QIAN
金额：
$ 4万
依托单位：
UNIVERSITY OF OKLAHOMA HLTH SCIENCES CTR
依托单位国家：
美国
项目类别：
财政年份：
2007
资助国家：
美国
起止时间：
2007-05-01 至 2008-04-30
项目状态：
已结题

项目摘要

This subproject is one of many research subprojects utilizing the resources provided by a Center grant funded by NIH/NCRR. The subproject and investigator (PI) may have received primary funding from another NIH source, and thus could be represented in other CRISP entries. The institution listed is for the Center, which is not necessarily the institution for the investigator. The subproject seeks to provide an efficient indexing method to speed up the search of large biological information databases. In particular, the research is based on a multi-dimensional disk-based index structure, called the ND-tree, which is designed to support similarity queries on vectors/q-grams of large non-ordered discrete data sets. The current method used to construct the ND-tree is incremental, which may significantly affect the effective use of the index due to the huge amount of data to be indexed. The subproject focuses on finding an efficient algorithm to bulkload the ND-tree. Unlike the incremental method, the new bulkloading algorithm assumes that there is some memory space available for bulkloading. Therefore, it is possible for the algorithm to load thousands of vectors into the index structure without incurring a single disk I/O, resulting in a significant reduction in the loading time. The algorithm is also designed such that a bulkloaded ND-tree has a comparable query performance to those incrementally constructed. To evaluate the effectiveness of the new algorithm, it will be experimentally compared with the incremental method and other existing bulkloading methods in terms of both loading and querying efficiency. A theoretical analysis of the bulkloading algorithm is planned for future research. Furthermore, the bulkloading algorithm will become an integrated part of a planned index-based bioinformatics search engine in future research.

这个子项目是许多研究子项目中的一个由NIH/NCRR资助的中心赠款提供的资源。子项目和研究者（PI）可能从另一个NIH来源获得了主要资金，因此可以在其他CRISP条目中表示。所列机构为研究中心，而研究中心不一定是研究者所在的机构。该分项目旨在提供一种有效的索引方法，以加快大型生物信息数据库的搜索。特别是，该研究是基于一个多维的磁盘为基础的索引结构，称为ND-树，这是专为支持大型无序离散数据集的向量/q-gram的相似性查询。目前构建ND树的方法是增量式的，由于需要索引的数据量巨大，这可能会严重影响索引的有效使用。该子项目的重点是找到一个有效的算法来批量加载ND树。与增量方法不同，新的大容量加载算法假定有一些内存空间可用于大容量加载。因此，该算法可以将数千个向量加载到索引结构中，而不会产生单个磁盘I/O，从而显着减少加载时间。该算法还被设计成使得大容量加载的ND树具有与增量构造的查询性能相当的查询性能。为了评估新算法的有效性，它将与增量方法和其他现有的批量加载方法在加载和查询效率方面进行实验比较。计划对批量装载算法进行理论分析，以供将来研究。此外，批量装载算法将成为一个集成的一部分，计划在未来的研究基于索引的生物信息学搜索引擎。