Development of Intelligent full text retrieval system based on data compression and fast string pattern matching algorithms
基于数据压缩和快速字符串模式匹配算法的智能全文检索系统开发
基本信息
- 批准号:13558029
- 负责人:
- 金额:$ 7.17万
- 依托单位:
- 依托单位国家:日本
- 项目类别:Grant-in-Aid for Scientific Research (B)
- 财政年份:2001
- 资助国家:日本
- 起止时间:2001 至 2003
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Suffix trees and Directed Acyclic Word Graphs(DAWGs) are well-known data structures as efficient indexingstructures for strings. We focus on Compact Directed Acyclic Word Graphs(CDAWGs) which are more compact indexing structures, and showed online construction algorithms for them. We also showed an online construction algorithm for an indexing structure consists of every DAWGs for all prefixes of given strings, and proved a lower-bound of the number of states of subsequence automata accepting all subsequences of a given string. We then introduced a new implementation technique based on ternary trees for DAWGs, which balances space efficiency and search time for a large alphabet, such as Japanese texts.We proposed an inverse problem in which we infer an original string from a given unlabelled graph corresponding to the indexing structures of the string. We showed linear-time algorithms for DAWG, subsequence automata, and suffix arrays in this setting. Moreover, we succeeded to prove a tight upper-bound of the length of solutions of world equations containing one variable.Concerning with data compression, we showed a space, efficient algorithm which outputs a compact context-free grammar representing a given string, and proved its approximation ratio. We also showed a linear-time compression algorithm using longest first replacement heuristics.In order to find patterns from large database in reasonable time, we developed several algorithms for classes of generalized patterns. Especially, we proposed an efficient pattern discovery algorithm in which we allow small mismatches of the pattern with data, and verified that it is practical by a series of computational experiments.
后缀树和有向无环词图(DAWG)是众所周知的数据结构,它们是有效的字符串索引结构。重点研究了紧凑有向无环词图(CDAWG)这一更为紧凑的索引结构,并给出了在线构造算法。我们还给出了一个由每个Dawgs组成的索引结构的在线构造算法,并证明了接受给定串的所有子序列的子序列自动机的状态数的下界。然后介绍了一种新的基于三叉树的Dawgs实现技术,该技术平衡了空间效率和搜索时间对大字母表(如日语文本)的影响,提出了一个反问题,即从给定的未标记图中推断出与字符串的索引结构相对应的原始字符串。在这种情况下,我们展示了Dawg、子序列自动机和后缀数组的线性时间算法。此外,我们成功地证明了一元世界方程的解的长度的一个紧上界,结合数据压缩,我们给出了一个空间高效的算法,该算法输出一个表示给定字符串的紧致上下文无关文法,并证明了它的逼近比。为了在合理的时间内从大型数据库中发现模式,我们提出了几种针对广义模式类的算法。特别是,我们提出了一种高效的模式发现算法,允许模式与数据的小失配,并通过一系列的计算实验验证了该算法的实用性。
项目成果
期刊论文数量(103)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Shunsuke Inenaga, Ayumi Shinohara, Masayuki Takeda, Setsuo Arikawa: "Compact Directed Acyclic Word Graphs for a Sliding Window"Lecture Notes in Computer Science. 2476(SPIRE2002). 310-324 (2002)
Shunsuke Inenaga、Ayumi Shinohara、Masayuki Takeda、Setsuo Arikawa:“滑动窗口的紧凑有向非循环字图”计算机科学讲义。
- DOI:
- 发表时间:
- 期刊:
- 影响因子:0
- 作者:
- 通讯作者:
Masahiro Hirao, Shunsuke Inenaga, Ayumi Shinohara, Masayuki Takeda, Setsuo Arikawa: "A practical algorithm to find the best episode patterns"Lecture Notes in Artificial Intelligence. 2226(DS2001). 435-440 (2001)
Masahiro Hirao、Shunsuke Inenaga、Ayumi Shinohara、Masayuki Takeda、Setsuo Arikawa:“寻找最佳剧集模式的实用算法”人工智能讲义。
- DOI:
- 发表时间:
- 期刊:
- 影响因子:0
- 作者:
- 通讯作者:
S.Inenaga et al.: "Compact Directed Acyclic Word Graphs for a Sliding Window"Lecture Notes in Computer Science. 2476. 310-324 (2002)
S.Inenaga 等人:“滑动窗口的紧凑有向非循环字图”计算机科学讲义。
- DOI:
- 发表时间:
- 期刊:
- 影响因子:0
- 作者:
- 通讯作者:
Masayuki Takeda et al.: "Discovering Most Classificatory Patterns for Very Expressive Pattern Classes"Lecture Notes in Computer Science. 2843. 486-493 (2003)
Masayuki Takeda 等人:“发现极具表现力的模式类的大多数分类模式”计算机科学讲义。
- DOI:
- 发表时间:
- 期刊:
- 影响因子:0
- 作者:
- 通讯作者:
Takuya Kida, Tetsuya Matsumoto, Y.Shibata, Masayuki Takeda, Ayumi Shinohara, Setsuo Arikawa: "Collage system : A unifying framework for compressed pattern matching"Theoretical Computer Science. Vol.298, Isse 1. 253-272 (2003)
Takuya Kida、Tetsuya Matsumoto、Y.Shibata、Masayuki Takeda、Ayumi Shinohara、Setsuo Arikawa:“拼贴系统:压缩模式匹配的统一框架”理论计算机科学。
- DOI:
- 发表时间:
- 期刊:
- 影响因子:0
- 作者:
- 通讯作者:
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
SHINOHARA Ayumi其他文献
SHINOHARA Ayumi的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('SHINOHARA Ayumi', 18)}}的其他基金
Development of e-learning system for university students
大学生电子学习系统的开发
- 批准号:
25560067 - 财政年份:2013
- 资助金额:
$ 7.17万 - 项目类别:
Grant-in-Aid for Challenging Exploratory Research
Development of A Research Support System for Stringology
弦学研究支持系统的开发
- 批准号:
23650002 - 财政年份:2011
- 资助金额:
$ 7.17万 - 项目类别:
Grant-in-Aid for Challenging Exploratory Research
A study on knowledge discovery based on data compression
基于数据压缩的知识发现研究
- 批准号:
20300052 - 财政年份:2008
- 资助金额:
$ 7.17万 - 项目类别:
Grant-in-Aid for Scientific Research (B)
Development of Intelligent Full-text Search System using Efficient Pattern Matching Algorithms on Compressed Data
利用压缩数据的高效模式匹配算法开发智能全文搜索系统
- 批准号:
10558047 - 财政年份:1998
- 资助金额:
$ 7.17万 - 项目类别:
Grant-in-Aid for Scientific Research (B).
相似海外基金
Collaborative Research: OAC Core: Topology-Aware Data Compression for Scientific Analysis and Visualization
合作研究:OAC 核心:用于科学分析和可视化的拓扑感知数据压缩
- 批准号:
2313124 - 财政年份:2023
- 资助金额:
$ 7.17万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: Topology-Aware Data Compression for Scientific Analysis and Visualization
合作研究:OAC 核心:用于科学分析和可视化的拓扑感知数据压缩
- 批准号:
2313122 - 财政年份:2023
- 资助金额:
$ 7.17万 - 项目类别:
Standard Grant
STTR Phase I: Machine Learning-Based Smart Data Compression Solutions for Structural Health Monitoring Sensors
STTR 第一阶段:用于结构健康监测传感器的基于机器学习的智能数据压缩解决方案
- 批准号:
2321884 - 财政年份:2023
- 资助金额:
$ 7.17万 - 项目类别:
Standard Grant
Compressed learning: theory and application of data compression technique that allows direct learning from optimally encoded data
压缩学习:数据压缩技术的理论和应用,允许从最佳编码数据中直接学习
- 批准号:
23K11233 - 财政年份:2023
- 资助金额:
$ 7.17万 - 项目类别:
Grant-in-Aid for Scientific Research (C)
Collaborative Research: OAC Core: Topology-Aware Data Compression for Scientific Analysis and Visualization
合作研究:OAC 核心:用于科学分析和可视化的拓扑感知数据压缩
- 批准号:
2313123 - 财政年份:2023
- 资助金额:
$ 7.17万 - 项目类别:
Standard Grant
Learning-Oriented Data Compression with Applications
面向学习的数据压缩及其应用
- 批准号:
RGPIN-2018-06768 - 财政年份:2022
- 资助金额:
$ 7.17万 - 项目类别:
Discovery Grants Program - Individual
Data compression for biomedical data analysis
用于生物医学数据分析的数据压缩
- 批准号:
DGDND-2022-03074 - 财政年份:2022
- 资助金额:
$ 7.17万 - 项目类别:
DND/NSERC Discovery Grant Supplement
A mathematical foundation for hexagon grids with apertures of even power of 2 and its application to high data compression of trajectory data
孔径为2偶次方六边形网格的数学基础及其在轨迹数据高数据压缩中的应用
- 批准号:
22H03597 - 财政年份:2022
- 资助金额:
$ 7.17万 - 项目类别:
Grant-in-Aid for Scientific Research (B)
Efficient log data compression and analytics system
高效的日志数据压缩和分析系统
- 批准号:
570524-2021 - 财政年份:2022
- 资助金额:
$ 7.17万 - 项目类别:
Alliance Grants
Data compression for biomedical data analysis
用于生物医学数据分析的数据压缩
- 批准号:
RGPIN-2022-03074 - 财政年份:2022
- 资助金额:
$ 7.17万 - 项目类别:
Discovery Grants Program - Individual