Development of Efficient Data Mining Systems for Large Semi-Structured Text Data
大型半结构化文本数据的高效数据挖掘系统开发
基本信息
- 批准号:11558040
- 负责人:
- 金额:$ 6.27万
- 依托单位:
- 依托单位国家:日本
- 项目类别:Grant-in-Aid for Scientific Research (B)
- 财政年份:1999
- 资助国家:日本
- 起止时间:1999 至 2001
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
The goal of this research project is to devise an efficient semi-automatic tool that supports human discovery from large unstructured and semi-structured text data. To achieve this goal, we studied in the following three directions.1. The central process of text mining is pattern discovery. We employed the framework of optimized pattern discovery, and developed effcient and robust text mining algorithms that find simple combinatorial patterns from large unstructured texts. To implement these algorithms, we developed a text index structure based on the suffix arrays suitable for text mining. Based on these technologies, we implemented a prototype system and run computer experiments on Web data.2. Another important technology for text is efficient pattern matching. As a theoretical framework, we proposed a unified framework, called Collage system, for realizing various dictionary-based compression methods. We developed both Knuth-Morris-Pratt type and Byer-Moore type pattern matching algorithms employing this framework. We also applied this framework to Byte-Pair-Encoding compression method and Sequitur, the former of which yields the fastest compressed pattern matching algorithm.3. Final process of text mining is information extraction. From theoretical point of view, we first formalize the information extraction problem from semi-structured data, and then gave theoretical analysis of the power and the limitation of such tasks. Then, we developed efficient information extraction algorithms for various types of extraction rules including tree wrappers and hedge patterns and evaluate them through experiments on real-life semi-structured data on the internet.
本研究项目的目标是设计一种高效的半自动工具,支持人类从大型非结构化和半结构化文本数据中进行发现。为了实现这一目标,我们从以下三个方面进行了研究。文本挖掘的核心过程是模式发现。我们采用优化模式发现框架,开发了高效鲁棒的文本挖掘算法,从大型非结构化文本中发现简单的组合模式。为了实现这些算法,我们开发了一个基于适合文本挖掘的后缀数组的文本索引结构。基于这些技术,我们实现了一个原型系统,并在Web数据上进行了计算机实验。另一个重要的文本技术是有效的模式匹配。作为理论框架,我们提出了一个统一的框架,称为Collage系统,用于实现各种基于字典的压缩方法。我们使用这个框架开发了Knuth-Morris-Pratt型和Byer-Moore型模式匹配算法。我们还将该框架应用于Byte-Pair-Encoding压缩方法和Sequitur,前者产生最快的压缩模式匹配算法。文本挖掘的最后一个过程是信息提取。从理论的角度出发,首先形式化了半结构化数据的信息提取问题,然后从理论上分析了这类任务的能力和局限性。然后,我们针对不同类型的提取规则(包括树包装和树篱模式)开发了高效的信息提取算法,并通过互联网上真实的半结构化数据实验对其进行了评估。
项目成果
期刊论文数量(104)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
M.Takeda et al.: "Mining from Literary Texts : Pattern Discovery and Similarity Computation"Lecture Notes in Computer Science. 2281. 520-533 (2002)
M.Takeda 等人:“从文学文本中挖掘:模式发现和相似性计算”计算机科学讲义。
- DOI:
- 发表时间:
- 期刊:
- 影响因子:0
- 作者:
- 通讯作者:
H.Arimura et al.: "Efficient Learning of Semi-Structured Data from Queries"Lecture Notes in Artificial Intelligence. 2225. 315-331 (2001)
H.Arimura 等人:“从查询中高效学习半结构化数据”人工智能讲座笔记。
- DOI:
- 发表时间:
- 期刊:
- 影响因子:0
- 作者:
- 通讯作者:
K. Tamari et al: "Discovering Poetic Allusion in Anthologies of Classical Japanese Poems"Proc. 2nd Int. Conf. on Discovery Science. LNAI1721. 128-138 (1999)
K. Tamari 等:“在日本古典诗歌选集中发现诗意典故”Proc。
- DOI:
- 发表时间:
- 期刊:
- 影响因子:0
- 作者:
- 通讯作者:
H. Hori et al.: "Fragmentary Pattern Matching : Complexity, Algorithms and Applications for Analyzing Classic Literary Works"Proc. 12th Annual International Symposium on Algorithms and Computation (ISAAC'01). 719-730 (2001)
H. Hori 等人:“片段模式匹配:分析经典文学作品的复杂性、算法和应用”Proc。
- DOI:
- 发表时间:
- 期刊:
- 影响因子:0
- 作者:
- 通讯作者:
Tetsuya Nasukawa et al.: "Base Technology for Text Mining"Journal of Japanese Society for Artificial Intelligence. 16(2). 201-211 (2001)
那须川哲也等:《文本挖掘的基础技术》日本人工智能学会期刊。
- DOI:
- 发表时间:
- 期刊:
- 影响因子:0
- 作者:
- 通讯作者:
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
ARIMURA Hiroki其他文献
The Complexity of Induced Tree Reconfiguration Problems
诱导树重构问题的复杂性
- DOI:
10.1587/transinf.2018fcp0010 - 发表时间:
2019 - 期刊:
- 影响因子:0.7
- 作者:
WASA Kunihiro;YAMANAKA Katsuhisa;ARIMURA Hiroki - 通讯作者:
ARIMURA Hiroki
ARIMURA Hiroki的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('ARIMURA Hiroki', 18)}}的其他基金
Next-Generation Semi-structured Data Mining for Large-Scale Knowledge Base Formation
用于大规模知识库形成的下一代半结构化数据挖掘
- 批准号:
20240014 - 财政年份:2008
- 资助金额:
$ 6.27万 - 项目类别:
Grant-in-Aid for Scientific Research (A)
相似海外基金
Development of Next-generation Semi-Structured Data Mining Technology Towards The Real-World Knowledge Creation Infrastructure
面向现实世界知识创造基础设施的下一代半结构化数据挖掘技术的开发
- 批准号:
20H00595 - 财政年份:2020
- 资助金额:
$ 6.27万 - 项目类别:
Grant-in-Aid for Scientific Research (A)
Automata for Semi-Structured Data
半结构化数据自动机
- 批准号:
441893214 - 财政年份:2020
- 资助金额:
$ 6.27万 - 项目类别:
Heisenberg Grants
CAREER: Transducer-Centric Parallelization for Scalable Semi-Structured Data Processing
职业:用于可扩展半结构化数据处理的以传感器为中心的并行化
- 批准号:
1751392 - 财政年份:2018
- 资助金额:
$ 6.27万 - 项目类别:
Continuing Grant
Coalgebraic Foundations of Semi-Structured Data
半结构化数据的代数基础
- 批准号:
EP/N015843/1 - 财政年份:2016
- 资助金额:
$ 6.27万 - 项目类别:
Research Grant
Next-generation semi-structured data mining technologies for real-world knowledge infrastructures
用于现实世界知识基础设施的下一代半结构化数据挖掘技术
- 批准号:
16H01743 - 财政年份:2016
- 资助金额:
$ 6.27万 - 项目类别:
Grant-in-Aid for Scientific Research (A)
Automata for Semi-Structured Data
半结构化数据自动机
- 批准号:
270792973 - 财政年份:2016
- 资助金额:
$ 6.27万 - 项目类别:
Heisenberg Professorships
Distributional Semantics Over Semi-structured Data
半结构化数据的分布语义
- 批准号:
488743-2015 - 财政年份:2015
- 资助金额:
$ 6.27万 - 项目类别:
Engage Grants Program
Estimating data structure embedded in semi-structured data
估计半结构化数据中嵌入的数据结构
- 批准号:
24300054 - 财政年份:2012
- 资助金额:
$ 6.27万 - 项目类别:
Grant-in-Aid for Scientific Research (B)
CAREER: Analyzing and Exploiting Meta-information for Keyword Search on Semi-structured Data
职业:分析和利用元信息进行半结构化数据的关键字搜索
- 批准号:
1322406 - 财政年份:2012
- 资助金额:
$ 6.27万 - 项目类别:
Continuing Grant
Development of Next-Generation Semi-structured Data Mining for Large-Scale Knowledge Base Formation
用于大规模知识库形成的下一代半结构化数据挖掘的开发
- 批准号:
24240021 - 财政年份:2012
- 资助金额:
$ 6.27万 - 项目类别:
Grant-in-Aid for Scientific Research (A)