AF: Small: Theoretical Aspects of Repetition-Aware Text Compression and Indexing
AF:小:重复感知文本压缩和索引的理论方面
基本信息
- 批准号:2112643
- 负责人:
- 金额:$ 44.98万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2021
- 资助国家:美国
- 起止时间:2021-10-01 至 2023-04-30
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Being able to store, search, and analyze massive data sets efficiently is one of today's pressing challenges. To that end, this project will study a collection of problems under text compression and indexing with tremendous current relevance, owing to a specific characteristic prevalent in many modern text data sets, called high repetitiveness. This characteristic makes the data highly compressible using some specialized schemes. However, the theoretical understanding of those schemes is still in a nascent stage. This project will address some of the fundamental open problems on the effectiveness of several schemes that are popular in practice. This project will also introduce new ideas for indexing such data in a highly space-efficient manner and quickly support various (application-specific) queries. The techniques developed in this project will apply to a broad class of algorithmic problems and applications; therefore, they will have a lasting impact on related fields. The main results stem from this research will be disseminated through major conferences, journals, and workshop tutorials. The participation of undergraduates and minorities, including women, will be ensured. Suffix trees and suffix arrays are two fundamental data structures for text indexing with many applications in bioinformatics. However, they are notorious for their space complexity. The FM-index index encodes suffix arrays in space close to the entropy-based lower bound using the Burrows-Wheeler Transformation (BWT). But, the entropy model of compression is less effective in capturing repetitiveness. Therefore, modern applications urge even more space frugal encodings that exploit repetitiveness. Although the community has made some recent progress in exact pattern matching, solutions to many other (more complex) matching problems are still open. To that end, this project will attempt to design repetition-aware indexes for pattern matching under mismatches, edits, and wildcards, along with efficient construction algorithms. This project also aims to develop a unified framework to compress several advanced suffix-tree variants (quasi-suffix trees) that support even more complex matching paradigms such as parameterized and order-isomorphic matching. Novel techniques and concepts that go beyond the existing BWT-based methods will be sought. Additionally, the project will attempt a deeper study on various aspects of BWT-runs and the relative Lempel-Ziv compression scheme from the hardness side.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
能够有效地存储、搜索和分析海量数据集是当今的紧迫挑战之一。为此目的,本项目将研究文本压缩和索引方面的一系列问题,这些问题与当前的重大相关性,因为许多现代文本数据集中普遍存在一个特定的特点,称为高度重复性。这个特性使得数据使用一些专门的方案高度可压缩。然而,对这些方案的理论理解仍处于起步阶段。该项目将解决一些基本的开放性问题,在实践中流行的几个计划的有效性。该项目还将引入以高度空间效率的方式为此类数据编制索引的新思路,并快速支持各种(特定于应用程序的)查询。在这个项目中开发的技术将适用于广泛的算法问题和应用,因此,它们将对相关领域产生持久的影响。这项研究的主要成果将通过主要会议、期刊和研讨会教程传播。将确保大学生和包括妇女在内的少数民族的参与。后缀树和后缀数组是文本索引的两种基本数据结构,在生物信息学中有着广泛的应用。然而,它们因其空间复杂性而臭名昭着。FM索引使用Burrows-Wheeler变换(BWT)在接近基于熵的下限的空间中对后缀数组进行编码。但是,压缩的熵模型在捕获重复性方面不太有效。因此,现代应用程序甚至敦促更多的节省空间的编码,利用重复性。虽然社区最近在精确模式匹配方面取得了一些进展,但许多其他(更复杂)匹配问题的解决方案仍然是开放的。为此,本项目将尝试设计重复感知索引,用于在不匹配、编辑和通配符下进行模式匹配,并使用沿着的高效构造算法。该项目还旨在开发一个统一的框架来压缩几个高级后缀树变体(准后缀树),这些变体支持更复杂的匹配范例,例如参数化和顺序同构匹配。将寻求超越现有基于BWT的方法的新技术和概念。此外,该项目将尝试从硬度方面对BWT运行的各个方面以及相对的Lempel-Ziv压缩方案进行更深入的研究。该奖项反映了NSF的法定使命,并通过使用基金会的知识价值和更广泛的影响审查标准进行评估,被认为值得支持。
项目成果
期刊论文数量(5)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Compact Text Indexing for Advanced Pattern Matching Problems: Parameterized, Order-Isomorphic, 2D, etc. (Invited Talk)
高级模式匹配问题的紧凑文本索引:参数化、顺序同构、2D 等(特邀演讲)
- DOI:10.4230/lipics.cpm.2022.3
- 发表时间:2022
- 期刊:
- 影响因子:0
- 作者:Thankachan, Sharma V.
- 通讯作者:Thankachan, Sharma V.
The Heaviest Induced Ancestors Problem: Better Data Structures and Applications
最严重的诱发祖先问题:更好的数据结构和应用程序
- DOI:10.1007/s00453-022-00955-7
- 发表时间:2022
- 期刊:
- 影响因子:1.1
- 作者:Abedin, Paniz;Hooshmand, Sahar;Ganguly, Arnab;Thankachan, Sharma V.
- 通讯作者:Thankachan, Sharma V.
On the Complexity of Recognizing Wheeler Graphs
论识别惠勒图的复杂性
- DOI:10.1007/s00453-021-00917-5
- 发表时间:2022
- 期刊:
- 影响因子:1.1
- 作者:Gibney, Daniel;Thankachan, Sharma V.
- 通讯作者:Thankachan, Sharma V.
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Sharma Thankachan其他文献
Sharma Thankachan的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Sharma Thankachan', 18)}}的其他基金
REU Site: Algorithm Design --- Theory and Engineering
REU网站:算法设计---理论与工程
- 批准号:
2349179 - 财政年份:2024
- 资助金额:
$ 44.98万 - 项目类别:
Standard Grant
AF: Small: Theoretical Aspects of Repetition-Aware Text Compression and Indexing
AF:小:重复感知文本压缩和索引的理论方面
- 批准号:
2315822 - 财政年份:2023
- 资助金额:
$ 44.98万 - 项目类别:
Standard Grant
CAREER: Algorithmic Aspects of Pan-genomic Data Modeling, Indexing and Querying
职业:泛基因组数据建模、索引和查询的算法方面
- 批准号:
2316691 - 财政年份:2023
- 资助金额:
$ 44.98万 - 项目类别:
Continuing Grant
CAREER: Algorithmic Aspects of Pan-genomic Data Modeling, Indexing and Querying
职业:泛基因组数据建模、索引和查询的算法方面
- 批准号:
2146003 - 财政年份:2022
- 资助金额:
$ 44.98万 - 项目类别:
Continuing Grant
NSF Student Travel Grant for Workshop on String Algorithms in Bioinformatics (StringBio), 2019
NSF 学生生物信息学字符串算法研讨会 (StringBio) 旅行补助金,2019
- 批准号:
1946289 - 财政年份:2019
- 资助金额:
$ 44.98万 - 项目类别:
Standard Grant
NSF Student Travel Grant for 2018 International Workshop on String Algorithms in Bioinformatics (StringBio)
NSF 学生旅费资助 2018 年生物信息学字符串算法国际研讨会 (StringBio)
- 批准号:
1849136 - 财政年份:2018
- 资助金额:
$ 44.98万 - 项目类别:
Standard Grant
AF: Medium: Collaborative Research: Sequential and Parallel Algorithms for Approximate Sequence Matching with Applications to Computational Biology
AF:媒介:协作研究:近似序列匹配的顺序和并行算法及其在计算生物学中的应用
- 批准号:
1703489 - 财政年份:2017
- 资助金额:
$ 44.98万 - 项目类别:
Standard Grant
相似国自然基金
昼夜节律性small RNA在血斑形成时间推断中的法医学应用研究
- 批准号:
- 批准年份:2024
- 资助金额:0.0 万元
- 项目类别:省市级项目
tRNA-derived small RNA上调YBX1/CCL5通路参与硼替佐米诱导慢性疼痛的机制研究
- 批准号:n/a
- 批准年份:2022
- 资助金额:10.0 万元
- 项目类别:省市级项目
Small RNA调控I-F型CRISPR-Cas适应性免疫性的应答及分子机制
- 批准号:32000033
- 批准年份:2020
- 资助金额:24.0 万元
- 项目类别:青年科学基金项目
Small RNAs调控解淀粉芽胞杆菌FZB42生防功能的机制研究
- 批准号:31972324
- 批准年份:2019
- 资助金额:58.0 万元
- 项目类别:面上项目
变异链球菌small RNAs连接LuxS密度感应与生物膜形成的机制研究
- 批准号:81900988
- 批准年份:2019
- 资助金额:21.0 万元
- 项目类别:青年科学基金项目
基于small RNA 测序技术解析鸽分泌鸽乳的分子机制
- 批准号:31802058
- 批准年份:2018
- 资助金额:26.0 万元
- 项目类别:青年科学基金项目
肠道细菌关键small RNAs在克罗恩病发生发展中的功能和作用机制
- 批准号:31870821
- 批准年份:2018
- 资助金额:56.0 万元
- 项目类别:面上项目
Small RNA介导的DNA甲基化调控的水稻草矮病毒致病机制
- 批准号:31772128
- 批准年份:2017
- 资助金额:60.0 万元
- 项目类别:面上项目
基于small RNA-seq的针灸治疗桥本甲状腺炎的免疫调控机制研究
- 批准号:81704176
- 批准年份:2017
- 资助金额:20.0 万元
- 项目类别:青年科学基金项目
水稻OsSGS3与OsHEN1调控small RNAs合成及其对抗病性的调节
- 批准号:91640114
- 批准年份:2016
- 资助金额:85.0 万元
- 项目类别:重大研究计划
相似海外基金
Collaborative Research: FET: Small: Theoretical Foundations of Quantum Pseudorandom Primitives
合作研究:FET:小型:量子伪随机原语的理论基础
- 批准号:
2329938 - 财政年份:2023
- 资助金额:
$ 44.98万 - 项目类别:
Standard Grant
CIF: SMALL: Theoretical Foundations of Partially Observable Reinforcement Learning: Minimax Sample Complexity and Provably Efficient Algorithms
CIF:SMALL:部分可观察强化学习的理论基础:最小最大样本复杂性和可证明有效的算法
- 批准号:
2315725 - 财政年份:2023
- 资助金额:
$ 44.98万 - 项目类别:
Standard Grant
AF: Small: Theoretical Aspects of Repetition-Aware Text Compression and Indexing
AF:小:重复感知文本压缩和索引的理论方面
- 批准号:
2315822 - 财政年份:2023
- 资助金额:
$ 44.98万 - 项目类别:
Standard Grant
NSF-BSF: Collaborative Research: CIF: Small: Neural Estimation of Statistical Divergences: Theoretical Foundations and Applications to Communication Systems
NSF-BSF:协作研究:CIF:小型:统计差异的神经估计:通信系统的理论基础和应用
- 批准号:
2308445 - 财政年份:2023
- 资助金额:
$ 44.98万 - 项目类别:
Standard Grant
Collaborative Research: FET: Small: Theoretical Foundations of Quantum Pseudorandom Primitives
合作研究:FET:小型:量子伪随机原语的理论基础
- 批准号:
2329939 - 财政年份:2023
- 资助金额:
$ 44.98万 - 项目类别:
Standard Grant
NSF-BSF: Collaborative Research: CIF: Small: Neural Estimation of Statistical Divergences: Theoretical Foundations and Applications to Communication Systems
NSF-BSF:协作研究:CIF:小型:统计差异的神经估计:通信系统的理论基础和应用
- 批准号:
2308446 - 财政年份:2023
- 资助金额:
$ 44.98万 - 项目类别:
Standard Grant
Investigating the Dark Sector with Small-Scale Cosmology: Theoretical Implications for Substructures and Their Observables
用小尺度宇宙学研究暗区:子结构及其可观测值的理论意义
- 批准号:
570282-2022 - 财政年份:2022
- 资助金额:
$ 44.98万 - 项目类别:
Alexander Graham Bell Canada Graduate Scholarships - Doctoral
Experimental and theoretical study of hydrodynamics of solid particle transport in small scale
小尺度固体颗粒输运的流体动力学实验与理论研究
- 批准号:
RGPIN-2017-05272 - 财政年份:2022
- 资助金额:
$ 44.98万 - 项目类别:
Discovery Grants Program - Individual
A Theoretical Model of Radiation Mechanism of Small Antenna for Improving Reliability of IoT Devices
提高物联网设备可靠性的小天线辐射机制理论模型
- 批准号:
21K14158 - 财政年份:2021
- 资助金额:
$ 44.98万 - 项目类别:
Grant-in-Aid for Early-Career Scientists
Theoretical, Positive and Institutional Research in Small M&A and Family Business Succession
小M的理论、实证和制度研究
- 批准号:
21K01727 - 财政年份:2021
- 资助金额:
$ 44.98万 - 项目类别:
Grant-in-Aid for Scientific Research (C)