权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

III: Small: Collaborative Research: Supporting Efficient Discrete Box Queries for Sequence Analysis on Large Scale Genome Databases

III：小型：协作研究：支持高效离散框查询以进行大规模基因组数据库的序列分析

基本信息

批准号：
1319909
负责人：
Sakti Pramanik
金额：
$ 27.34万
依托单位：
Michigan State University
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2013
资助国家：
美国
起止时间：
2013-09-01 至 2018-08-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1319909&HistoricalAwards=false
关键词：
III Small Collaborative Research Supporting

项目摘要

This collaborative research project, conducted jointly by the investigators from the Michigan State University (MSU) and the University of Michigan at Dearborn (UM-D), investigates the issues and techniques for storing and searching/querying large scale k-mer data sets (i.e., overlapping k-length subsequences obtained from genome sequences) for sequence analysis in bioinformatics. Efficient k-mer indexing, storage and retrieval are vital to sequence analysis tasks like error correction as sequencing data set sizes increase vastly. Most existing methods for storing and searching k-mers are optimized for exact or range queries. However, this reliance limits the types of sequence analysis that can be done efficiently. Moreover, most existing methods for storing k-mers do not support efficient storage of k-mers at multiple word lengths. For many sequence analysis problems, including error correction, variant detection, and assembly, searches with multiple word lengths enable better sensitivity and specificity. In this project, various techniques for efficiently supporting so-called (discrete) box queries and other related queries (e.g., hybrid queries) on large scale k-mer data sets for sequence analysis are investigated. The approaches to optimizing box queries in solving sequence analysis problems like the error correction are examined. The storage structure and adoption of box queries for supporting searches with multiple word lengths on k-mer data sets are explored. The results from this research will advance the state of knowledge for storage, indexing and retrieval techniques for genome sequence databases. They are expected to significantly impact current practice in bioinformatics by making available new efficient on-disk solutions for sequence analysis. They will also impact a number of other popular application areas including biometrics, image processing, social network, and E-commerce, where processing non-ordered discrete multidimentional data is crucial. This collaborative research project, conducted jointly by the investigators from the Michigan State University (MSU) and the University of Michigan at Dearborn (UM-D), investigates the issues and techniques for storing and searching/querying large scale k-mer data sets for sequence analysis in bioinformatics. Efficient k-mer indexing, storage and retrieval are vital to sequence analysis tasks like error correction as sequencing data set sizes increase vastly. Most existing methods for storing and searching k-mers are optimized for exact or range queries. However, this reliance limits the types of sequence analysis that can be done efficiently. Moreover, most existing methods for storing k-mers do not support efficient storage of k-mers at multiple word lengths. For many sequence analysis problems, searches with multiple word lengths enable better sensitivity and specificity. In this project, various techniques for efficiently supporting so-called (discrete) box queries and other related queries (e.g., hybrid queries) on large scale k-mer data sets for sequence analysis are investigated. In particular, a new index tree, named the BoND-tree, specially designed for a non-ordered discrete data space characterized by k-mer data sets is developed. The unique properties of the space are exploited to develop new node splitting heuristics for the index tree, and theoretical analysis is performed to show the optimality of the proposed heuristics. Besides the BoND-tree, which is based on data partitioning, space-partitioning based index schemes for box quieres in such a space are also developed. To support a more flexible type of query (i.e., hybrid box and range queries), hybrid index schemes integrating strengths of both box query indexes and range query indexes are studied. To facilitate an efficient index construction for large scale k-mer data sets, bulk loading techniques are also developed for the proposed index trees. In addition, the approaches to optimizing box queries in solving sequence analysis problems like the error correction are examined. The storage structure and adoption of box queries for supporting searches with multiple word lengths on k-mer data sets are also explored. The research in the project will result in the discovery of fundamental properties of the data space for sequence data in bioinformatics, the development of a number of novel storage, indexing and retrieval techniques exploiting the properties of such a data space, and the applications of the proposed techniques for solving important problems in sequence analysis. These results will advance the state of knowledge for storage, indexing and retrieval techniques for genome sequence databases. They are expected to significantly impact current practice in bioinformatics by making available new efficient on-disk solutions for sequence analysis. They will also impact a number of other popular application areas including biometrics, image processing, social network, and E-commerce, where processing non-ordered discrete multidimentional data is crucial.

这个合作研究项目由密歇根州立大学（MSU）和密歇根大学迪尔伯恩分校（UM-D）的研究人员联合进行，研究了存储和搜索/查询大规模k-mer数据集的问题和技术（即，从基因组序列获得的重叠k长度序列）用于生物信息学中的序列分析。高效的k-mer索引，存储和检索对于序列分析任务（如错误校正）至关重要，因为测序数据集大小大幅增加。大多数现有的用于存储和搜索k-mer的方法针对精确或范围查询进行了优化。然而，这种依赖性限制了可以有效进行的序列分析的类型。此外，大多数现有的用于存储k-mer的方法不支持以多个字长有效存储k-mer。对于许多序列分析问题，包括错误校正、变异检测和组装，具有多个字长的搜索能够实现更好的灵敏度和特异性。在这个项目中，用于有效地支持所谓的（离散）框查询和其他相关查询（例如，混合查询）对用于序列分析的大规模k-mer数据集进行研究。研究了在解决序列分析问题（如纠错）中优化框查询的方法。探讨了在k-mer数据集上支持多字长搜索的存储结构和框查询的采用。这项研究的结果将推进基因组序列数据库的存储，索引和检索技术的知识状态。它们有望通过为序列分析提供新的高效磁盘解决方案，对生物信息学的当前实践产生重大影响。它们还将影响许多其他流行的应用领域，包括生物识别，图像处理，社交网络和电子商务，其中处理无序离散多维数据至关重要。这个合作研究项目由密歇根州立大学（MSU）和密歇根大学迪尔伯恩分校（UM-D）的研究人员联合进行，研究了生物信息学中用于序列分析的大规模k-mer数据集的存储和搜索/查询的问题和技术。高效的k-mer索引，存储和检索对于序列分析任务（如错误校正）至关重要，因为测序数据集大小大幅增加。大多数现有的用于存储和搜索k-mer的方法针对精确或范围查询进行了优化。然而，这种依赖性限制了可以有效进行的序列分析的类型。此外，大多数现有的用于存储k-mer的方法不支持以多个字长有效存储k-mer。对于许多序列分析问题，使用多个字长的搜索可以实现更好的灵敏度和特异性。在这个项目中，用于有效地支持所谓的（离散）框查询和其他相关查询（例如，混合查询）对用于序列分析的大规模k-mer数据集进行研究。特别是，一个新的索引树，命名为BoND树，专门为无序的离散数据空间，其特征在于k-mer数据集的开发。利用该空间的独特性质，提出了一种新的索引树节点分裂算法，并通过理论分析证明了该算法的最优性。除了基于数据分区的BoND树之外，还开发了针对此类空间中的盒状区域的基于空间分区的索引方案。为了支持更灵活类型的查询（即，混合框和范围查询），研究了综合框查询索引和范围查询索引优点的混合索引方案。为了便于大规模k-mer数据集的高效索引构建，批量加载技术也被开发用于所提出的索引树。此外，在解决序列分析问题，如错误校正优化框查询的方法进行了检查。本文还探讨了在k-mer数据集上支持多字长搜索的存储结构和框查询的采用。该项目的研究将导致发现生物信息学中序列数据的数据空间的基本属性，开发一些新的存储，索引和检索技术，利用这种数据空间的属性，以及应用所提出的技术解决序列分析中的重要问题。这些结果将推进基因组序列数据库的存储，索引和检索技术的知识状态。它们有望通过为序列分析提供新的高效磁盘解决方案，对生物信息学的当前实践产生重大影响。它们还将影响许多其他流行的应用领域，包括生物识别，图像处理，社交网络和电子商务，其中处理无序离散多维数据至关重要。